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1  Research  Overview 


The  research  vehicle  for  this  contract  is  the  largest  possible  computer  that  could  be  conceived  for  the  mid 
to  late  19'JO's.  The  technical  challenges  of  such  a  machine  serve  as  our  guiding  stimulus  for  the  research 
carried  out  and  reported  here 

We  imagine  this  machine  to  occupy  a  14-story  building,  to  cost  upward  of  Si  billion,  and  to  be  so 
colossal  that  the  nation  could  only  afford  one  or  two  of  them.  The  available  chip  technology  and  machine 
size  are  consistent  with  101’  FLOPS  and  1015  bytes  of  memory.  The  machine  will  be  used  to  solve 
large-scale  scientific  problems  having  both  military  and  civilian  applications. 

This  investigation  addresses  the  hardware  technology,  software  techniques,  algorithms,  communica¬ 
tions.  processing  elements,  and  applications.  The  study  will  determine  the  plausibility  (not  feasibility)  of 
the  machine.  Progress  in  these  various  areas  are  highlighted  in  the  forthcoming  sections. 


2  Circuits 

Sandy  Wells  and  Tom  Knight  have  designed  and  tested  MSI  Drototypes  of  a  new  class  of  analog  computing 
devices,  based  on  switched  capacitor  constraint  boxes.  The  core  of  these  devices  is  a  two-port  consisting 
of  a  capacitor  rapidly  switched  between  the  ports.  Labelling  the  terminal  voltages  a.b.c.d.  this  attempts 
to  enforce  a  constraint  a  -  b  =  c  -  d.  This  is  a  reciprocal  constraint,  allowing  propagation  of  information 
in  either  direction.  We  have  shown  that,  using  this  basic  constraint  box.  we  can  solve  linear  systems  (to 
arhif  rary  accuracy  using  mixud  anaiog  ligital  techniques),  solve  over-constrained  systems  with  the  pseudo- 
inverse.  and  solve  linear  programming  problems.  The  small  size,  simplicity,  and  ease  of  understanding, 
argue  that  this  device  may  be  an  important  circuit,  element  in  next-generation  hybrid  computing. 

Srinivas  Devadas  and  his  students  have  been  focusing  on  the  optimization  of  combinational  and  sequen¬ 
tial  circuits  specified  at  the  register-transfer  or  logic  levels  with  area,  using  performance  and  testability  of 
the  synthesized  circuit  as  design  parameters.  Work  is  also  being  done  in  the  area  of  test  generation  for 
VLSI  circuits. 

Techniques  have  been  proposed  in  the  past  for  various  types  of  finite  state  machine  (FSM)  decomposi¬ 
tion  that  use  the  number  of  states  or  edges  in  the  decomposed  circuits  as  the  cost  function  to  be  optimized. 
These  measures  are  not  reflective  of  the  true  logic  complexity  of  the  decomposed  circuits.  These  methods 
have  been  mainly  heuristic  in  nature  and  offer  limited  guarantees  as  to  the  quality  of  the  decomposition. 
In  this  work  [32],  following  up  on  our  -'xact  state  assignment  algorithm  developed  earlier  [31].  we  have 
developed  optimum  and  heuristic  algorithms  for  the  general  decomposition  of  FSMs  such  that  the  sum 
total  of  the  number  of  product  terms  in  the  one-hot  coded  and  logic  minimized  submachines  is  minimum 
or  minimal.  This  cost,  function  is  much  more  reflective  of  the  area  of  an  optimally  state-assigned  and 
minimized  submachine  than  the  number  of  states/edges  in  the  submachine. 

We  are  continuing  to  investigate  the  impact  of  logic  synthesis  on  the  testability  of  sequential  circuits 
that  can  be  modeled  as  finite  state  machines  [33]  [34]  [37]  [30].  The  new  approach  of  [34]  and  [37]  is 
io  use  synthesis  to  ensure  the  complete  testability  of  a  sequential  circuit  by  ensuring  that  each  invalid 
state  has  an  unperturbable  distinguishing  sequence.  To  accomplish  this  we  have  developed  a  Boolean 
minimization  procedure  of  prime  implicant  generation  and  constrained  covering  based  on  the  Quine- 
Mcf’luskey  algorithm  that  ensures  that  no  single  fault  can  both  produce  an  invalid  state  and  corrupt  the 
distinguishing  sequence  by  which  that  invalid  state  can  be  identified.  On  completion,  it  guarantees  a  prime 
and  irredundant.  fully  testable  Moore  or  Mealy  finite  state  machine.  Given  a  two-level  circuit  with  these 
properties  we  then  define  constrained  algebraic  factorization  techniques  that  retain  the  invariant  that  no 
single  fault  can  both  produce  an  invalid  state  and  corrupt  the  distinguishing  sequence  by  which  that  invalid 
state  is  detected  We  have  used  the  notion  of  fault-effect  disjointness  to  explore  the  landscape  between 
various  synthesis  approaches  and  have  demonc‘,-a‘ed  a  sp^trum  c? methods  [37]  that,  place  relatively  rr.orc- 
or-|ess  emphasis  on  either  logic  optimization  or  constrained  synthesis.  Techniques  used  in  this  exploration 
include  include  fault  simulation.  Boolean  covering,  algebraic  factorization  and  state  assignment. 

We  have  explored  the  relationships  between  redundant  logic  and  don’t  care  conditions  in  sequential 
circuits  [30].  Stuck-at  faults  in  a  sequential  circuit  may  be  testable  in  the  combinational  sense,  but  may¬ 
be  redundant  because  they  do  not,  alter  the  terminal  behavior  of  a  non-scan  sequential  machine.  These 
sequential  redundancies  result  in  a  faulty  State  Transition  Graph  (STG)  that  is  equivalent  to  the  STG  of 


the  true  machine.  We  h  ave  precisely  classified  redundant  faults  in  sequential  circuits  composed  of  single 
or  interacting  finite  state  machines.  For  each  of  the  different  classes  of  redundancies,  we  define  don't  care 
sets  winch  if  optimally  exploited  will  result  in  the  implicit  elimination  of  any  such  redundancies  in  a  given 
circuit 

W'e  have  also  addressed  the  problem  of  generating  test  sequences  for  stuck-at  faults  in  non-scan  syn¬ 
chronous  sequential  circuits  [38].  A  novel  test  procedure  that  exploits  both  the  structure  of  the  combina¬ 
tional  logic  in  the  circuit  as  well  as  t he  sequential  behavior  of  the  circuit  has  been  developed.  In  contrast 
to  previous  approaches,  we  decompose  the  problem  of  sequential  test  generation  into  three  subproblems 
of  combinational  test  generation,  fault-free  state  justification  and  fault-free  state  differentiation.  Initially, 
prior  to  test  generation,  separate  sum-of-product  representations  of  the  complete  or  partial  ON-sets  and 
OFF-sets  of  each  of  the  flip-flop  inputs  and  primary  outputs  of  the  sequential  circuit,  are  extracted  using 
the  POD  KM  algorithm.  Fast  algorithms  for  state  justification  and  state  differentiation  can  be  based  on 
this  representation.  These  algorithms  perform  repeated  cube  intersections  in  an  effort  to  find  a  justification 
sequence  for  a  state  or  a  distinguishing  sequence  for  a  pair  of  states. 

vWe  have  addressed  the  problem  of  generating  tests  for  delay  faults  in  non-scan  synchronous  sequential 
circuits  [36].  Delay  test  generation  for  sequential  circuits  is  a  considerably  more  difficult  problem  than 
delay  testing  of  combinational  circuits  and  has  received  much  less  at'ention.  We  have  developed  a  method 
for  generating  test  sequences  to  detect  delay  faults  in  sequential  circuits  using  a  stuck-at  fault  sequential 
test  generator  The  method  is  complete  in  that  it  will  generate  a  delay  test  sequence  for  a  targeted  fault 
given  sufficient  CPF  time,  if  such  a  sequence  exists.  We  term  faults  for  which  no  delay  test  sequence  exists, 
under  our  test  methodology,  sequentially  delay  redundant.  We  have  also  developed  means  of  eliminating 
sequential  delay  redundancies  in  logic  circuits. 

Finally,  we  have  done  some  preliminary  work  in  an  attempt  to  gain  insight  into  the  nature  of  NP- 
remplete  problems.  In  [35].  we  have  transformed  various  NP-complete  problems  in  layout,  namely  two  and 
multi-layer  dogleg  channel  routing,  two-way  partitioning,  one-dimensional  and  two-dimensional  placement 
into  Boolean  satisfiability  problems.  The  transformations  are  efficient  in  that  the  number  of  inputs  to  the 
Boolean  function  for  which  we  have  to  find  a  satisfying  assignment,  grows  only  linearly  or  quasi-linearly 
with  the  layout  problem  size.  We  have  applied  sophisticated  test  generation  and  logic  verification  strategies 
that  can  be  used  to  check  for  Boolean  function  satisfiability  to  these  layout  problems.  It  appears  that  this 
approach  to  layout  optimization  offers  an  elegant  means  of  representing  and  searching  the  entire  space  of 
feasible  solutions  in  an  attempt  to  optimize  a  complex  cost  function  with  associated  constraints. 


3  Processing  Elements 

I  he  processors  of  a  multicomputer  require  the  ability  to  switch  tasks  rapidly  to  hide  transmission  latency 
without  sacrificing  single-thread  performance.  Peter  N'uth  and  Bill  Dally  are  working  on  an  architecture 
for  a  named  state  processor  that  achieves  this  goal  by  explicitly  binding  names  to  all  processor  registers 
and  interleaving  tasks  on  a  microcycle  basis.  This  mechanism  combines  the  advantages  of  multi-threading 
and  multiple  register  sets  for  implementing  fast  context  switches  and  procedure  calls.  It  also  provides  a 
general  synchronization  mechanism. 

During  the  past  year,  we  have  defined  the  named  state  processor  architecture  and  its  interface  to  a 
multicomputer  network.  We  are  currently  studying  instruction  scheduling  policies  (deciding  which  pro¬ 
cesses  instructions  get  advanced  when)  and  context  cache  management  policies  (deciding  which  processes 
state  remains  in  active  storage).  A  simulator  for  the  processor  is  under  construction.  This  work  is  being 
performed  by  Peter  N’uth  as  his  MIT  Ph.D.  thesis. 

Most  multicomputer.?  are  specialized  to  execute  a  single  model  of  computation  (e.g.,  dataflow,  actors 
or  shared  memory).  Scott.  Wills  and  Bill  Dally  have  identified  a  set  of  primitive  mechanisms  for  com¬ 
munication.  synchronization  and  naming  that  are  required  for  all  of  these  models  of  computation.  We 
are  currently  alii?>.*ing  these  mechanisms  in  terms  of  their  implementation  cost  and  their  suitability  for 
supporting  popular  models  of  paraliJ  computation  [51]  [55]. 

During  the  reporting  period,  we  have  defined  a  parallel  machine  interface  that  incorporates  a  consistent 
set  of  these  mechanisms.  A  parallel  interface  simulator.  PiSIM,  has  been  constructed  to  facilitate  exper 
irnerits  with  the  interface.  Using  PiSIM,  dataflow  and  shared  memory  models  of  computation  have  been 
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implemented  on  the  parallel  machine  interface.  We  are  presently  evaluating  the  cost  and  performance  of 
t  hf'se  implement  at  ions. 

Anant  Agarwal  has  investigated  the  use  of  rapid-context  switching  VLSI  RISC  processors  as  the  com- 
putimr  nodes  in  a  large  parallel  machine.  Rapid  context  switching  allows  overlapping  communication  and 
synchronization  delays  with  computation  by  quickly  scheduling  a  new  process  on  the  processor.  The  de¬ 
sign  of  such  a  processor  is  complete.  The  processor,  APRIL,  switches  between  threads  on  either  memory 
accesses  to  remote  nodes,  or  during  an  unsuccessful  access  of  a  synchronization  object.  APRIL  has  tag 
support  for  Futures,  and  sy  nchronization  support  in  the  form  of  full-empty  bits  associated  with  each  mem¬ 
ory  word.  .APRIL  also  has  several  basic  instructions  to  allow  experimentation  with  a  variety  of  shared 
memory  programming  models.  These  special  operations  include  cache  flushes,  fences,  block  transfers, 
and  user  definable  choice  of  spin-waiting  versus  blocking.  An  instruction-level  simulator  for  APRIL  has 
been  written.  A  Mul-T  compiler  for  this  processor  has  been  written  and  generates  code  that  runs  on  the 
simulator.  A  scheduler  that  exploits  the  multithreaded  nature  of  the  processor  and  other  run-time  system 
software  has  also  been  written  and  runs  on  the  simulator.  An  implementation  design  consisting  of  very 
minor  modifications  to  the  SPARC  processor  is  almost  complete.  Because  floating  point  operations  are 
usually  supported  through  the  use  of  coprocessors  in  most  modern  day  VLSI  RISC  microprocessors,  we  are 
investigating  methods  of  multithreading  a  coprocessor.  A  performance  evaluation  of  the  system  effects  of 
multithreaded  processors  has  also  been  completed  [59].  The  analytical  evaluation  considered  the  context 
switching  overhead,  and  the  increased  cache  and  network  contention.  We  showed  that  for  most  system 
configurations,  while  providing  for  network,  cache  and  overhead  effects,  between  two  and  fou  •  contexts 
were  sufficient  to  provide  close  to  DOeffects. 

We  are  designing  a  scalable  cache  and  memory  system.  A  detailed  protocol  design  for  a  scalable  cache 
coherence  scheme  is  complete  and  has  been  implemented  in  a  simulator.  A  cache  controller  design  is  in 
progress.  A  VLSI  implementation  of  the  same  is  envisaged  in  the  near  future.  The  architectural  and 
VLSI  circuit  design  of  a  fast  and  low-storage-overhead  translation  scheme  for  processor  addresses  is  in 
progress.  Simulations  of  various  cache  coherence  schemes  such  as  limited  directories,  singly  and  doubly 
linked  lists  and  write-through  shared,  are  in  progress.  Our  simulations  use  traces  from  numeric  FORTRAN 
codes,  graph  algorithms  written  in  Mul-T.  and  CAD  applications  written  in  C.  (Our  FORTRAN  t r  "s 
were  obtained  through  a  joint  effort  with  IBM  T.  J.  Watson  Research  Center.  The  Mul-T  traces  were 
obtained  through  a  compiler-aided  tracing  package  we  wrote  called  T-Mul-T.  We  have  made  these  traces 
available  to  other  researchers  also.  The  CAD  traces  are  from  Stanford).  Initial  results  indicate  that 
the  performance  of  singly  linked  lists  is  comparable  to  doubly  linked  lists  without  the  extra  hardware 
overhead  and  complexity.  Limited  directories  are  shown  to  perform  comparably  if  software  support  for 
widely-shared  read-only  objects  and  synchronization  structures  is  providjd.  We  wrote  a  novel  post-mortem 
scheduler  that  can  take  a  single-processor  execution  of  a  parallel  program  and.  simulating  th»>  effect  of 
various  synchronization  implementations  such  as  adaptive  backoff  [5]  Oi  software  berrier  trees,  produce 
cache  statistics  for  the  various  synchronization  implementations  [39]. 


4  Communications  Topology  and  Routing  Algorithms 

Bill  Dally  and  his  students  are  experimenting  with  a  new  flow  control  strategy  based  virtual  channels. 
Our  init.al  results  show  that  this  strategy  can  boost  network  throughput,  to  90%  capacity  without  adaptive 
routing  by  decoupling  resource  constraints.  Current  flow  control  methods  are  limited  to  309f  to  oQ% 
capacity  because  many  channels  remain  idle  due  to  resource  allocation  coupling.  This  throughput  limit  is 
not  due  to  load  imbalance,  which  can  only  he  addressed  by  adaptive  routing. 

The  virtual  channel  flow  control  method  divides  a  channel's  flit  buffers  into  many  shallow  'lanes',  rather 
than  a  single  deep  FIFO.  The  buffering  is  short  and  wide  rather  than  long  and  fat.  The  organization 
decouples  flit  buffer  resource  allocation  for  each  channel  This  allows  active  messages  to  pass  blocked 
messages  that  are  waiting  on  an  unrelated  resource  much  in  the  way  that  a  two  lane  street  permits  cars 
travelling  straight  ahead  to  pass  a  car  that  is  waiting  to  make  a  left,  turn 

W  e  have  built  a  simulator  of  direct  and  indirect  networks  that  use  virtual  channel  flow  control  and 
have  measured  their  performance  under  different  loads  and  traffic  patterns.  The  initial  results  suggest  that 
a  moder,it„  number  of  virtual  channels  (4-8)  gives  a  throughput  that  is  very  close  to  network  capacity. 
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The  remaining  degradation  is  largely  due  to  load  imbalance  and  adaptive  routing  will  be  required  to  reach 
lOO'T  capacity. 

Express  cubes  are  k-ary  n-cube  interconnection  networks  augmented  by  eipress  channels  that  provide 
a  short  path  for  non-local  messages.  An  e.\ press  cube  combines  the  logarithmic  diameter  of  an  indirect 
network  with  the  wire-efficiency  and  ability  to  exploit  locality  of  a  direct  network.  The  insertion  of  express 
channels  reduces  the  network  diameter  and  thus  the  distance  component  of  network  latency.  Wire  length  is 
increased  allowing  networks  to  operate  with  latencies  that  approach  the  physical  speed-of-light  limitation 
rather  than  being  limited  by  node  delays.  Express  channels  increase  wire  bisection  in  a  manner  that 
allows  the  bisection  to  be  controlled  independent  of  the  choice  of  radix,  dimension,  and  channel  width 
By  increasing  wire  bisection  to  saturate  the  available  wiring  media,  throughput  can  be  substantially 
increased.  With  an  express  cube  both  latency  and  throughput  are  wire-limited  and  within  a  small  factor 
of  the  physical  limit  on  performance.  Express  channels  may  be  inserted  into  existing  interconnection 
networks  using  interchanges.  So  changes  to  the  local  communication  controllers  are  required. 

Tom  Knight  an1'1  his  students  are  continuing  implementation  work  on  the  Transit  communication 
switch.  We  have  released  to  manufacturing  the  design  for  the  button  board  connector,  and  for  the  PC’ 
board  component  carrier.  The  carrier  cooiing  technology  has  evolved  somewhat  s;nce  our  last  report 
as  a  result  of  detailed  heat  flow  calculations.  Our  current  approach  involves  flowing  coolant  through  a 
microchannel  heatsink  bonded  directly  to  the  rear  surface  of  the  die,  similar  to  the  approach  used  by 
Tuckerman  at  Stanford,  but  at  a  more  macroscopic  level. 

Die  design  continues,  with  the  gate-level  description  and  stable  test-patterns,  and  with  initial  sizing 
and  layout  work  under  way.  Initial  RSIM  estimates  by  Henry  Minsky  of  timing  (now  at  17ns)  indicate 
that  substantial  additional  effort  will  be  required  to  achieve  our  target  of  a  10ns  clock  rate?,  but  we  remain 
cautiously  optimistic. 

Recent  design  changes  in  the  chip  specification,  adding  a  per-input-port  "swallow"  signal,  allow  the 
use  of  this  design  in  combination  with  some  as-yet  missing  packaging  technology  to  construct  much  larger 
switching  arrays  based  on  Leiserson's  fat-tree  topology.  Andre  DeHon  is  actively  pursuing  the  topological, 
packaging,  and  electrical  requirements  of  this  expansion. 

Alex  Ishii  is  incorporating  recent  shifts  from  voltage  control  of  the  pad  output  impedance  to  a  scheme 
utilizing  digitally  controlled  D/A  networks  for  irnplmenting  the  controlled  impedance  pullup  and  pulldown 
devices. 

We  have  located  commercial  suppliers  for  closed  loop  Fluorinert  cooling  systems,  and  plan  to  purchase 
tins  component  when  it  appears  to  be  the  pacing  item  in  the  design.  High  efficiency  low  voltage  power 
supplies  remain  a  difficult  issue,  but  interim  low-efficiency  designs  will  allow  us  to  test  the  remainder  of 
the  system,  while  determining  more  efficient  systems. 

Network  design  for  large-scale  machines  was  investigated  by  Anant  Agarwal  and  his  students.  We 
showed  that  when  switch  delay  was  included  in  the  analysis of  direct  interconnection  networks,  the  optimal 
network  implemented  in  two  physical  dimensions  in  terms  of  the  latency,  was  three  dimensional.  This  is  in 
contrast  to  previous  findings  that  showed  that  a  two  dimensional  network  was  optimal.  The  chief  reason 
for  the  difference  is  that  node  delays  can  make  the  wire  delays  have  a  relatively  smaller  impact  on  overall 
latency.  A  detailed  performance  model  for  circuit-switched  interconnection  networks  was  developed  [60] 
Simulators  for  circuit-switched  and  packet-switched  indirect  networks  are  operational,  and  we  now  also 
have  a  packet-switched  direct  network  simulator. 


5  Systems  Software 

Andrew  Chien  and  Bill  Dally  are  developing  data  abstraction  tools  that  support  the  development  of  pro¬ 
grams  for  large  scale  multicomputers.  A  language,  concurrent  aggregates,  has  been  defined  that  facilitates 
the  specification  of  aggregates  of  cooperating  objects.  Concurrent  aggregates  permit  the  relationships 
between  objects  to  he  defined  textually  rather  than  requiring  that  the  objects  connect  up  a  pointer  struc¬ 
ture  at  run-time  as  is  typically  done.  Common  structures  (e  g  ,  combining  trees)  can  be  defined  once  and 
reused  as  required.  The  language  also  permits  nesting  of  object  aggregates  and  specialization  of  objects 
within  the  aggregate.  This  work  is  being  performed  by  Andrew  Chien  for  his  MIT  Ph  D.  thesis. 

During  the  reporting  period,  the  concurrent  aggregates  (CA)  language  has  been  defined.  A  compiler 
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that  translates  CA  programs  to  C++  has  been  written.  The  output  of  this  compiler  is  linked  with  a 
nm-time  written  in  C++  that  simulates  parallel  machine  execution.  A  number  of  programs  have  been 
written  in  CA  to  evaluate  the  language.  A  study  of  the  efficiency  of  the  language  and  its  implementation 
is  currei  !y  underway. 

Bill  Dally  and  Lucien  Van  Elsen  have  developed  a  technique,  micro-optimization,  for  reducing  the 
operation  count  and  time  required  to  perform  numerical  calculations.  The  method  invol.es  first  breaking 
floating  point  operations  into  their  constituent  integer  micro-operations,  then  optimizing  and  scheduling 
the  resulting  integer  code.  The  method  has  been  tested  using  a  prototype  expression  compiler  [54].  We 
are  now  looking  at  extending  the  method  to  permit  a  compiler  to  perform  automatic  scaling  of  numbers. 
Where  it  is  possible,  this  optimization  would  convert  floating  point  expressions  into  integer  expressions. 

John  Keen  and  Bill  Dally  have  been  investigating  several  problems  involved  in  constructing  highly 
concurrent  database  systems  on  concurrent  computers  augmented  by  large  disk  arrays.  The  goal  is  to 
develop  systems  technology  that  will  permit  database  systems  based  on  concurrent  computers  to  handle 
I  IT’  transactions  per  second.  To  date  we  have  concentrated  on  parallel  algorithms  for  logging,  recovery,  and 
consistency  control  1  ne  parallel  logging  and  recovery  algorithms  make  use  of  parallel  logs  that  represent 
a  partial  order  of  actions  and  the  use  of  log  processors  to  compress  the  logs  on  a  regular  basis.  We  are 
investigating  consitency  control  algorithms  that,  use  reservations  to  achieve  a  higher  degree  of  concurrency 
than  is  possible  using  locks. 

Anant  Agarwal  has  continued  explorations  of  methods  of  programming  a  large-scale  parallel  computer 
such  as  the  ARC.  These  investigations  take  two  forms.  First,  we  are  looking  at  methods  of  partitioning 
and  scheduling  parallel  programs  to  minimize  communications.  Numerical  algorithms  that  can  exploit 
locality  are  being  investigated.  Tradeoffs  in  the  use  of  block  techniques  for  linear  algebraic  codes  are  being 
studied.  We  currently  have  several  parallel  address  traces  of  several  runs  of  parallel  blocking  methods 
and  we  are  studying  their  impact  on  cache  and  network  performance.  Scheduling  methods  that  exploit 
both  locality  and  the  communication  latency  hiding,  provided  by  a  rapid  context-switching  processor, 
are  being  investigated.  Our  experimental  scheduler  runs  on  our  simulation  system.  Our  second  thrust  is 
towards  enhancing  our  parallel  programming  language  to  allow  (1)  the  convenient  specification  of  data 
parallelism  using  structures  similar  to  the  dataflow  1-structures,  and  (2)  allow  experimentation  with  data 
placement  and  relocation,  function  and  data  shipping,  and  different  programming  models  including  weaker 
shared  memory  models  with  block  transfer  capabilities.  Our  current  status  is  that  the  language  primitives 
have  been  defined  as  extensions  to  Mul-T  and  their  implementation  in  the  compiler  and  simulator  are  in 
progress.  The  APRIL  compiler  and  linker  and  the  lazy  future  kernel  have  been  implemented.  Extensions 
for  garbage  collection  and  efficient  floating-point  support  are  being  developed.  The  T  language  has  also 
been  sorted  to  the  Sparc  and  the  Decstation  (Pmax), 

To  gain  more  experience  with  programming  large-scale  parallel  machines  we  are  also  writing  several 
parallel  applications.  Our  major  effort  has  been  spent  on  Speech.  This  application  comprises  the  viterbi 
search  portion  of  a  connected  speech  recognition  system  being  implemented  by  the  Speech  and  Spoken 
Language  Systems  Croup  at  MIT.  We  have  also  written  particle-in-cell  in  Mul-T.  Several  other  parallel 
applications  that  we  have  written  include  logic  simulation,  and  permute.  The  Simple  application  is  also 
partially  written  in  Mul-T. 

Several  performance  evaluation  tools  and  methods  have  been  developed.  Our  T-Mul-T  multiprocessor 
address  tracer  is  operational.  We  developed  a  technique  for  trace  compaction  that  exploits  the  spatial 
locality  of  memory  referencing  in  multiprocessors  [61].  A  novel  model  for  multithreaded  processors  has 
also  been  derived.  A  processor  locality-based  multiprocessor  cache  interference  model  has  been  developed 

[58]. 

System  studies  putting  all  the  above  pieces  together  are  also  in  progress.  A  detailed  multiprocessor 
simulator  has  been  implemented  and  is  functional.  The  simulator  is  comprised  of  the  APRIL  processor 
simulator,  the  cache  and  memory  system,  and  the  interconnection  network.  Parallel  applications  written 
in  Mul-T  are  compiled  to  APRIL  code  and  can  be  executed  on  the  multiprocessor  simulator.  We  have 
successfully  run  our  large  speech  application  on  16  processors,  each  with  a  multithreaded  degree  of  four.  If 
needed  the  FORTRAN  post-mortem  scheduler  or  T-Mul-T  tracer  can  replace  the  APRIL  processor  front 
'■rid 
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6  Algorithms 

[n  t lie  area  of  algorithms,  thiee  students — Ron  Greenberg,  Bruce  Maggs.  and  Cindy  Phillips — finished 
their  Ph  D.  theses  under  the  direction  of  Charles  Leiserscn. 

Ronald  Greenberg  has  completed  his  PhD.  thesis  entitled  "'Efficient  Interconnection  Schemes  for  VLSI 
and  Parallel  Computation."  The  thesis  is  primarily  concerned  with  the  design  of  efficient  interconnec¬ 
tion  networks  for  general  puprpose  parallel  computers  and  the  more  specialized  problem  of  multilayer 
channel  routing  for  VLSI  chips.  In  addition,  it  prosides  lower  bounds  on  the  area  required  for  VLSI 
implementations  of  finite-state  machines. 

The  first  part  of  Greenberg's  thesis  shows  why  networks  based  on  Leiserson's  fat-tree  architecture  are 
nearly  as  good  as  any  network  built  in  a  comparable  amount  of  physical  space.  Such  networks  can  simulate 
any  other  network  of  the  same  area  with  slowdown  which  is  a  small  polylogarithmic  function  of  the  area. 
These  "universal"  networks  can  be  constructed  in  area  linear  in  the  number  of  processors,  so  that  there  is 
no  need  to  restrict  the  density  of  processors  in  competing  networks.  Also  it  is  possible  to  compare  networks 
that  are  of  different  size  or  are  built  from  processors  of  different  sizes  (as  determined  by  the  amount  of 
attached  memory).  In  addition,  many  of  the  results  given  do  not  require  the  usual  assumption  of  unit 
wire  delay.  Also,  it  is  possible  to  simulate  competing  networks  even  if  the  processors  are  not  globally 
synchronized  into  separate  phases  of  internal  computation  and  interprocessor  communication.  Finally,  the 
results  apply  not  only  in  two  dimensions,  but  also  in  three  dimensions  by  way  of  a  simple  demonstration 
of  general  results  on  graph  layout  in  three  dimensions.  This  part  of  the  thesis  includes  joint  work  with 
Charles  Leiserson  of  MIT. 

The  second  part  of  Greenberg's  thesis  discusses  the  channel  routing  problem  in  the  context  that  many 
layers  of  interconnect  are  available.  It  describes  a  system.  MulC-h.  for  multilayer  channel  routing,  which 
extends  the  Chameleon  system  developed  at  U.  C.  Berkeley.  Like  Chameleon,  MulCh  divides  a  multilayer 
problem  into  essentially  independent  subproblems  of  at  most  three  layers,  but  unlike  Chameleon,  MulCh 
considers  the  possibility  of  using  partitions  comprised  of  a  single  layer  instead  of  only  partitions  of  two  or 
three  layers.  Experimental  results  show  that  MulCh  often  performs  better  than  Chameleon  in  terms  of 
channel  width,  total  net  length,  and  number  of  vias.  In  addition  to  a  description  of  MulCh  as  implemented. 
Greenberg’s  thesis  discusses  improved  algorithms  for  subtasks  performed  by  MulCh.  thereby  indicating 
potential  improvements  in  the  speed  and  performance  of  multilayer  channel  routing.  In  particular,  linear 
time  suffices  to  determine  the  minimum  width  required  for  a  single-layer  channel  routing  problem,  and 
the  density  of  a  collection  of  nets  can  be  maintained  in  logarithmic  time  per  net  insertion.  The  work  on 
MulCh  is  joint  with  Alex  Ishii  of  MIT  and  Alberto  Sangiovanni-Vincentelli  of  U.  C.  Berkeley;  the  work 
on  single-layer  channel  routing  is  joint  with  Miller  Maley  of  Princeton  U. 

The  last  part  of  Greenberg's  thesis  shows  that  straightforward  techniques  for  implementing  finite-state 
machines  are  optimal  in  the  worst  case.  Specifically,  for  any  s  and  k.  there  is  a  deterministic  finite-state 
machine  with  s  states  and  k  symbols  such  that  any  layout  algorithm  requires  fl(fcslgs)  area  to  lay  out  its 
realization  For  nondeterministic  machines,  there  is  an  analogous  lower  bound  of  Q(ks~)  area.  This  work 
is  joint  with  Mike  Foster  of  Columbia  University. 

Bru  ce  Maggs  also  finished  his  dissertation,  entitled  Locality  in  Parallel  Computation.  The  thesis  ex¬ 
plores  strategies  for  exploiting  locality  in  three  major  areas  of  parallel  computation:  packet  routing,  parallel 
algorithm  design,  and  i.etwork  emulations. 

The  first  part  of  Maggs 's  thesis  deals  with  a  novel  network-independent  approach  to  the  packet-routing 
problem  The  strategy  is  to  partition  the  problem  into  two  stages:  a  path-selection  stage  and  a  scheduling 
stage.  In  the  first  stage  paths  are  found  for  the  packets  with  small  congestion,  c.  and  dilation,  d.  Once  the 
paths  are  fixed,  both  are  lower  bounds  on  the  time  required  to  deliver  the  packets.  In  the  second  stage  we 
find  a  schedule  for  the  movement,  of  each  packet  along  its  path  so  that  no  two  packets  traverse  the  same 
edge  at  the  same  time;  consequently,  the  total  time  and  maximum  queue  size  required  to  route  all  of  the 
[jackets  to  their  destinations  are  minimized. 

Although  path-selection  strategies  vary  from  network  to  network,  Maggs  shows  that  there  is  an  efficient 
on-line  scheduling  algorit  hm  for  the  entire  class  of  layered  networks.  When  applied  to  an  .V-packet  problem, 
the  algorithm  produces  a  schedule  of  length  0(c  4-  d  4-  iog;V).  with  high  probability.  The  algorithm  has 
many  applications  to  routing  and  sorting.  Among  them  are  the  first  on-line  algorithms  for  routing  .V- 
packet.s  ori  an  .V-node  shuffle-exchange  graph  iri  O(log.V)  steps  using  constant-size  queues  and  for  routing 
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kMk  packets  on  a  ^-dimensional  array  with  side  length  M  in  O(kM)  steps  using  constant-size  queues. 
The  scheduling  algorithm  can  also  be  used  as  a  subroutine  in  sorting  algorithms.  It  yields  the  first 
asymptotically  optimal  algorithms  for  sorting  on  butterfly,  shuffle-exchange,  and  multidimensional  array 
networks  using  constant-size  queues.  The  algorithm  can  also  be  applied  to  the  construction  of  area- 
universal  networks:  .Y-node  networks  with  YLSl-layout  area  O(X)  that  can  simulate  all  other  networks 
with  area  0[X)  with  only  O(log.Y)  slowdown.  Maggs  also  proves  the  existence  of  a  schedule  of  length 
Q(c  -t-  d)  for  any  set  of  nackets  whose  paths  have  congestion  c  and  dilation  d  (in  any  network)  that  uses 
constant-size  uueif-.j.  I'nfortunately,  no  efficient  algorithm  for  constructing  the  schedule  is  known. 

The  second  part  of  Maggs 's  thesis  introduces  a  model  foi  parallel  computation,  called  the  distributed 
random-access  machine  (DRAM),  in  which  the  communication  requirement"  of  parallel  algorithms  can  be 
evaluated.  A  DRAM  is  an  abstraction  of  a  parallel  computer  in  which  memory  accesses  are  implemented 
by  routing  messages  through  a  communication  network.  It  explicitly  models  the  congestion  of  messages 
across  cuts  of  the  network 

Maggs  introduces  the  notion  of  a  conservative  algorithm  as  one  whose  communication  requirements  at 
each  step  can  be  bounded  by  the  congestion  of  pointers  of  the  input  data  structure  across  cuts  of  a  DRAM. 
A  conservative  algorithm  is  guaranteed  not  to  generate  undo  congestion  in  any  underlying  network.  Maggs 
presents  conservative  algorithms  for  a  variety  of  graph  problems.  Problems  such  as  computing  treewalk 
numberings,  finding  the  separator  of  a  tree,  and  evaluating  all  subexpressions  in  an  expression  tree  can 
be  solved  in  0( log  .V )  steps  for  .Y-node  trees  by  conservative  algorithms  for  an  exclusive-read  exclusive- 
write  DRAM.  More  complex  problems  include  finding  a  minimum-cost  spanning  forest,  and  computing 
biccnnected  components  and  constructing  an  Eulerian  cycle  require  0(log‘.Y)  steps,  for  graphs  of  size 
A  .  For  concurrent-read  concurrent- write  DRAM’s.  all  of  these  problems  can  be  solved  by  O(log.Y)  step 
conservative  algorit hms. 

The  final  part  of  the  thesis  examines  the  problem  of  how  efficiently  a  host  network  can  emulate  a  guest 
network.  The  goal  is  to  emulate  Tq  steps  of  an  .VG-node  guest  network  on  an  Xu  node  host  network. 
An  emulation  is  called  uork-presernng  if  the  time  required  by  the  host,  Th  is  0{Ta  Xg  /  X  h)  because 
then  both  the  guest  and  host  networks  perform  the  same  amount  of  total  work  (processor-time  product). 
0(  T(j  Xq  ).  to  within  a  constant  factor.  A  work-preserving  emulation  is  efficient  because  it  achieves  optimal 
speedup  over  a  sequential  emulation  of  the  guest.  An  emulation  is  real-time  if  Th  =  O(Tq).  because  then 
the  host  emulates  the  guest  with  ctm.s'ant  delay. 

Although  many  isolated  emulation  results  have  been  proved  for  specific  networks  in  the  past,  and 
measures  such  as  dilation  and  congestion  were  known  to  be  impor'ant.  the  field  has  lacked  a  model  within 
which  general  results  and  meaningful  lower  bounds  could  be  proved.  Maggs  provides  such  a  model,  along 
with  techniques  for  proving  lower  bounds  based  on  comparing  the  locality  the  networks.  Some  of  the 
more  interesting  and  diverse  results  in  this  part  of  the  thesis  include  a  proof  that  a  linear  array  can 
emulate  a  (much  larger)  butterfly  in  a  work-preserving  fashion,  but  that  a  butterfly  cannot  emulate  an 
expander  (of  any  size)  in  a  work-preserving  fashion:  a  proof  that  a  mesh  can  be  emulated  in  real  time  in 
a  work-preserving  fashion  on  a  butterfly,  even  though  any  0(l)-to-l  embedding  of  the  mesh  has  dilation 
Qflog.Y):  and  a  proof  that  an  .V-node  butterfly  ran  emulate  an  .Y  log. Y-node  shuffle-exchange  graph  in  a 
work-preserving  fashion,  and  vice-versa. 

Cynthia  Phillips  finished  her  dissertation,  entitled  Theoretical  and  Experimental  Analyses  of  Parallel 
Combinatorial  Algorithms.  The  thesis  investigates  parallel  algorithms  for  graph  and  matrix  problems. 
Some  of  the  algorithms  are  known,  and  some  she  has  developed  She  has  analyzed  them  theoretically  and 
experimentally.  The  thesis  is  broken  into  five  parts. 

The  first  major  contribution  of  her  thesis  shows  how  n-node,  e-edge  graphs  can  be  contracted  in  a  man¬ 
ner  similar  to  the  parallel  tree  contraction  algorithm  due  to  Miller  and  Reif.  She  gives  an  0((n  +  e)/  Ig  n)- 
processor  deterministic  algorithm  that  contracts  a  graph  in  0(lg2n)  time  in  the  ERE\Y  PRAM  model. 
She  also  gives  an  Oln/  Ig  n)- processor  randomized  algorithm  that  with  high  probability  can  contract  a 
bounded-dei/ref  graph  in  (9(lgri  -t-lg"  *« )  time,  where  y  is  the  maximum  genus  of  any  connected  component 
of  the  graph  ('[lie  algorithm  ran  be  made  to  run  in  deterministic  Oflgnlg’  n  +  lg“y)  time  using  known 
techniques  )  This  algorithm  does  not  require  a  prion  knowledge  of  t  he  genus  of  t  he  graph  to  be  contracted 
I  lie  contraction  algorithm  for  bounded-degree  graphs  can  be  used  directly  to  solve  the  problem  of  region 
labeling  in  vision  systems,  i.e.,  determining  the  connected  components  of  bounded-degree  planar  graphs 
in  Oflgn)  time,  thus  improving  the  best  previous  bound  of  0(lg"ri) 


The  second  part  describes  lour  APL  like  primitives  for  manipulating  dense  matrices  and  vectors  and 
describe  their  implementation  on  the  Connection  Machine  hypercube  multiprocessor  These  primitives 
provide  a  natural  way  of  specifying  parallel  matrix  algorithms  independently  of  machine  size  or  architec¬ 
ture  and  can  actually  enhance  efficiency  by  facilitating  automatic  load  balancing  The  implementations 
ire  efficient  in  tlie  frequently  occurring  case  where  there  are  fewer  processors  than  matrix  elements.  I: 
particular,  if  there  are  in  >  plgp  matrix  elements,  where  p  is  the  number  of  processors,  then  the  im¬ 
plementations  of  some  of  the  primitives  are  asymptotically  optimal  for  a  weak  hypercube  in  that  the 
processor-time  product  is  no  more  than  a  constant  factor  higher  than  the  running  time  of  the  best  serial 
algorithm.  Furthermore,  the  parallel  time  required  is  optimal  to  within  a  constant  factor.  Her  imple¬ 
mentation  of  the  primitives  on  the  Connection  Machine  2  system  improved  the  performance  of  a  simplex 
program  for  linear  programming  by  almost  an  order  of  magnitude  over  a  naive  implementation,  from  55 
Mliv.ps  to  525  M flops 

The  third  portion  of  her  thesis  investigates  dimension-exchange  load  balancing  which  is  a  generalization 
of  one  of  the  techniques  used  in  the  hypercube  implementation  of  the  vector-matrix  primitives.  She  shows 
that  when  tasks  are  considered  indivisible,  after  one  pass  of  dimension-exchange  load  balancing,  in  the 
worst  case,  some  processor  will  have  @(lgn)  tasks  over  the  average.  She  also  shows  that  there  is  an  initial 
distribution  of  tasks  for  which  this  load-balancing  strategy  requires  an  average  of  O(lgn)  messages  for 
each  unit  reduction  in  the  global  maximum  number  of  tasks. 

The  fourth  part  of  Phillips's  thesis  reports  on  preliminary  experimental  investigations  which  indicate 
that  massively  parallel  computers  like  the  Connection  Machine  (CM)  appear  to  be  well  suited  for  both 
-parse  and  dense  implementations  of  dual  relaxation  algorithms  for  network  optimization.  (Her  parallel 
implementation  of  a  nonlinear  network  optimization  program  on  the  Connection  Machine  is  the  fastest 
program  to  date  for  its  class  of  problems.)  Implementations  of  a  dense  version  of  a  known  algorithm 
for  the  assignment  problem  and  parallel  versions  of  known  heuristics  for  the  traveling  salesman  problem 
suffered  from  a  "sequential  tail"  phenomenon  Tail-cutting  heuristics  with  appropriate  (case-sensitive i 
paranienters  improved  performance  markedly. 

Die  lift  h  and  last  contribution  in  her  thesis  is  the  design  of  a  VLSI  chip  which  pseudorandomly  permutes 
hit-serial  messages  by  sending  them  through  a  Irenes  network  whose  switches  have  been  pseudorandomly 
set .  Providing  a  pseudorandom  permuter  in  a  simple,  high-throughput  chip  could  improve  the  performance 
of  routing  algorithms  for  multiprocessors. 

Shlomo  Kipms  investigated  priority  arbitration  schemes  that  employ  busses  to  arbitrate  among  n 
modules  in  a  digital  system.  He  focused  on  distributed  mechanisms  that  employ  m  busses,  for  lg  n  < 
m  <  n.  and  use  asynchronous  combinational  arbitration  logic.  A  widely  used  distributed  asynchronous 
mechanism  is  the  binary  arbitration  scheme,  which  with  rn  =  Ign  busses  arbitrates  in  t  —  lg  n  units 
of  tim>\  Shlomo  Kipnis  presented  a  new  asynchronous  scheme  —  binomial  arbitration  --  that  by  using 
m  =  ig  i)  -f  1  busses  reduces  the  arbitration  time  to  t  =  Llgn.  Extending  this  result,  he  presented  the 
generalized  binomial  arbitration  scheme  that  achieves  a  bus-time  tradeoff  of  the  form  m  =  Q(tn^')  between 
the  number  of  arbitration  busses  m  and  the  arbitration  time  t  (in  units  of  bus-sett  ling  delay),  for  values  of 
1  <  t  <  Ig  n  and  Ig  n  <  rn  <  n.  These  schemes  are  based  on  a  novel  analysis  of  data-deprndent  delays  and 
generalize  the  two  known  schemes:  linear  arbitration,  which  with  in  =  n  busses  achieves  (  =  1  time,  and 
binary  arbitration,  which  with  m  =  lg  n  busses  achieves  t  —  lg  n  time.  Most  importantly,  these  schemes 
can  lie  a  ’opted  with  no  changes  to  existing  hardware  and  protocols;  they  merely  involve  selecting  a  good 
set  of  priority  arbitration  codewords.  The  binomial  arbitration  and  the  generalized  binomial  arbitration 
schemes  are  a  subject  of  a  patent  application 

Bruce  Maggs  and  Tom  Leighton  have  been  studying  adaptive  fault-tolerant  algorithms  for  packet 
routing  I  hey  have  shown  that  an  .V-input.  multibutterfly  can  sustain  k  faults  and  still  route  log  .V  per¬ 
il  ml  at  ions  between  some  set  of  S'-O(k)  inputs  and  .V  —  O(k)  outputs  in  0(  log  .V )  t  ime.  The  mult  i  butterfly 
is  even  more  resilient  to  randomized  faults.  For  example,  with  high  probability,  a  specially  modified  twin 
butterfly  can  tolerate  ,V'V4  faulty  internal  nodes,  and  still  route  any  log  .V  permutations  of  .V  packets  in 
Of  log  .V)  time  I  bus,  the  multibutterfly  is  the  first  bounded-degree  network  known  to  be  able  to  sustain 
large  numbers  of  faults  with  only  minimal  degradation  in  performance. 

In  the  past,  year.  Torn  Cormen  has  continued  to  write  the  textbook  Introduction  to  Algorithms  with 
Professors  Leiserson  arid  Rivest..  The  book  will  be  published  in  early  1000. 

Marios  Papaefthymiou  continued  his  research  on  synchronous  circuit  optimization  under  the  supervision 
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nf  Prof.  Leiserson  His  work  focused  on  investigating  the  underlying  structure  of  the  retiming  operation. 
I'll'-  result  of  this  effort  was  a  concise  closed-semiring  description  of  retiming  for  unit-delay  circuits.  This 
•  impact  Inscription  suggests  a  promising  point  of  view  for  looking  at  retiming.  Marios  Papaefthymiou 
i>  .-urreiitlv  trying  to  design  efficient  algorithms  tor  optimum  retiming,  by  exploiting  t he  group  structure 

tint  he  revealed . 

During  i  In-  past  six  months.  James  K.  Park  has  been  collaborating  with  Alok  Aggarwal  and  Dina 
Kravets  on  a  number  of  problems  relating  to  totally  monotone  arrays.  Such  arrays  arise  naturally  in  a 
wide  variety  T  fields,  including  computational  geometry,  dynamic  programming,  and  VLSI  river  routing. 
Park's  work  with  Aggarwal  centers  on  the  problem  of  finding  maximum  entries  in  totally  monotone  arrays 
and  applications  of  efficient  sequential  and  parallel  algorithms  for  this  problem.  Park's  work  with  Kravets 
investigates  the  problems  of  selection  and  sorting  in  the  context  of  totally  monotone  arrays  and  application  ; 
of  efficient  algorit nms  for  these  problems. 

Alexander  Ishii  has  been  generalizing  his  VLSI  timing  analysis  algorithms.  A  key  concern  has  been  the 
need  to  accurately  handle  the  "undefined”  values  that  electrical  signals  must  take  on  when  they  make  a 
transition  between  valid  logic  levels.  In  addition,  he  has  attempted  to  make  the  algorithms  easily  adaptable 
to  different  assumptions  about  the  circuit  being  analyzed. 

Prof  Leighton  is  continuing  Ins  research  on  networks  and  algorithms  for  parallel  computation.  Recently 
tie  has  focussed  on  the  following  specific  problems:  the  development  of  fast  packet  routing  algorithms  for 
commonly  used  fixed-connection  networks.  Hie  development  of  algorithms  to  reconfigure  networks  such  as 
tin*  liypercube  around  faults,  the  development  of  dynamic  on-line  algorithms  for  embedding  computational 
structures  such  as  trees  in  networks,  such  as  the  hypercube,  in  a  way  that  balances  computational  load 
and  (hat  minimizes  the  induced  communication  load  on  the  network,  the  development  of  algorithms  for 
••mulating  one  kind  of  network  on  another  in  a  way  tb  it  preserves  the  total  amount  of  work  (processors  < 
time)  that  is  done,  and  the  development  of  a  new  network  architecture  for  routing  that  can  tolerate  large 
numbers  of  faults  without  a  substantial  degradation  in  performance.  The  particular  advances  that  have 
been  made  in  each  of  these  areas  is  briefly  sumrrr  rized  in  what  follows. 

In  the  area  of  packet  routing.  Prof.  Leighton  and  his  coauthors  have  discovered  the  first  store-and- 
forward  routing  algorithm  which  can  route  n*  packets  in  2n  —  2  steps  on  an  n  x  n  array  with  constant 
size  queues  at  each  node.  The  details  of  these  and  related  results  can  be  found  in  [16].  They  have  also 
discovered  new  and  more  efficient  routing  algorithms  for  the  multibutterfly.  These  algorithms  are  the  first 
that,  are  highly  tolerant  of  worst  case  faults.  Also  in  the  area  of  fault-tolerance,  Prof.  Leighton  and  his 
coauthors  hav  shown  that,  a  hypercube  can  tolerate  a  very  large  number  (a  constant  fraction)  of  randomly 
located  faults  without  incurring  more  than  a  constant  factor  loss  in  performance,  no  matter  how  large 
tin-  hyperrube  i  They  have  also  discovered  simple  algorithms  for  routing  around  faults  in  the  hypercube 
that  are  guaranteed  to  perform  nearly  as  well  as  the  best  routing  algorithms  when  no  faults  are  present. 
The  details  of  this  work  are  described  m  [12]. 

In  the  area  of  network  embeddings  and  scheduling.  Prof.  Leighton  and  his  coauthors  have  liscovered 
optimal  algorithms  for  embedding  dynamically  growing  and  shrinking  trees  in  a  hypercube  so  t hat  the 
processing  load  on  the  nodes  of  the  hypercube  is  balanced,  and  so  that  all  communication  links  are  local. 
This  work  has  application  to  the  problem  of  locally  scheduling  the  work  assigned  to  the  processors  of  a 
hypercube  in  a  dynamic  fashion  (i.e..  as  one  computation  spawns  another,  the  algorithm  determines  the 
processor  that  will  handle  the  new  task).  They  have  also  discovered  optimal  algorithms  for  mapping  code 
written  for  one  architecture  onto  a  different,  architecture  in  a  way  that  minimizes  the  total  amount  of  work 
required  by  the  similating  machine.  These  results  are  described  in  [7.25]. 

Ihe  past  year  was  also  a  good  one  for  Prof  Leighton’s  students.  Bruce  Maggs,  Satish  Rao.  Richard 
Koch .  arid  Mark  Newman  all  obtained  their  Ph  D.s  this  year  Together  with  Prof.  Leighton,  they  made 
lots  of  solid  progress  on  packet  routing  algorithms,  fault  tolerance  in  networks,  and  on  graph  embedding 
problems.  At  this  point  they  are  getting  close  to  asymptotically  optimal  results  that  also  appear  to  work 
well  in  reality  In  fart .  the  highlight  of  the  coining  year  will  be  to  help  design  and  lay  out  a  . unit ibut terfly 
network  for  loin  Knight  s  new  machine  With  a  little  luck,  theory  will  be  able  to  play  an  important 
role  m  the  development  of  a  state  of  the  art.  machine.  Prof.  Leigh'on  is  also  working  with  Bill  Dally  and 
his  students  to  see  if  theory  ran  be  helpful  with  the  routing  protocols  on  his  new  machine,  and  he  has 
been  talking  with  Alan  Barat.z  about  the  possibilities  of  implementing  some  of  thp  new  theory  routing 
algorit  bins  on  the  IBM  (•  F !  1  so  t hat  it  can  become  a  general  purpose  rout  ing  machine. 


Another  highlight  of  the  hist  six  months  was  the  new  ACM  Symposium  on  Parallel  Algorithms  and 
Architectures  that  Prot.  Leighton  helped  to  organize  The  first  meeting  was  in  Sante  Fe  in  mid-June, 
and  'he  meeting  was  very  successful.  Papers  t*  at  were  presented  ranged  from  theory  to  practice  and  the 
meeting  provided  a  good  forum  for  interaction  between  people  who  think  about  parallel  machines,  those 
who  build  them,  and  those  who  use  them. 

7  Applications 

Over  the  past  six  months,  efforts  in  developing  numerical  algorithms  for  problems  related  to  the  design 
of  an  ARC.  as  well  as  those  that  can  effectively  exploit  the  ARC’s  capability,  have  continued  under 
the  direction  of  Jacob  White.  Interesting  new  algorithms  have  been  unearthed  in  the  areas  of  parallel 
circuit  simulation  and  monte  carlo  device  simulation.  In  addition,  preliminary  experiments  with  recently 
developed  algorithms  m  caoacitance  extraction  and  classical  semiconductor  device  simulation  have  hee.i 
completed  with  very  encouraging  results. 

In  the  area  of  circuit  emulation,  we  have  completed  the  development  of  SIMLAB  [  69.70J.  a  fast, 
general  purpose  circuit  simulation  program  intended  for  use  in  clr'uit  simulation  research.  The  program 
is  presently  being  used  for  our  course  in  numerical  simulation  as  well  as  forming  the  basis  for  'hret 
ingoing  research  projects.  SIMLAB  is  being  used  to  study  multiple  timepoint  methods  for  increasing  the 
parallelism  in  circuit  simulation  so  as  to  effectively  exploit  a  massively  parallel  processor  on  reasonable 
sized  problems.  In  addition.  SIMLAB  is  being  used  to  study  multigrid  variations  for  efficient  simulation 
f  the  analog  arrays,  like  those  used  in  early  vision. 

SIMLAB  has  also  been  used  to  study  the  behavior  of  the  switched  linear  resistive  and  nonlinear 
resistive  networks  used  for  image  smoothing  and  segmentation  algorithms  (under  the  supervision  of  Prof. 
J  Wyatt).  Arc-length  style  continuation  methods  were  added  to  SIMLAB  so  that  comparison  studies  of 
several  continuation  methods  can  be  gracefully  implemented  in  analog  VLSI. 

Also  m  the  area  of  circuit  simulation,  we  have  undertaken  a  study  of  Exponential-Fitting  numerical 
integration  algorithms.  We  have  been  able  to  prove  several  si  zong  result:,  indicating  that  the  performance  of 
r>-cent ly  published  exponential-fitting  algorithms  are.  in  the  limit  of  large  timesteps,  identical  to  other  well- 
known  techniques  Detailed  experiments  indicate  exponential-fitting  offers  little  advantage.  We  have  also 
examined  several  modifications  which  seem  to  improve  the  accuracy  of  the  exponential-fitting  algorithm, 
but  it  is  unlikely  to  produce  results  thta  are  competitive  with  more  standard  techniques. 

In  the  area  of  classb  al  device  simulation,  we  have  completed  preliminary  experiments  using  waveform 
relaxation  to  perform  transient  two-dimensional  simulation  of  MOS  devices.  Experiments  demonstrate 
t ho  c  WR  converges  in  a  uni f-  rm  manner,  and  that  there  is  typically  some  multirate  behavior  in  a  device 
that  the  WR  algorithm  can  exploit.  Speed  and  accuracy  comparisons  between  standard  direct  methods, 
red/blaek  (Jauss- Seidel  WR.  and  red/black  overrelaxed  WR  indicate  that  for  the  experiments  examined, 
'•'ll ciliated  terminal  currents  match  well  between  the  me  hods,  and  that  overrelaxed  WR  was  between  2  and 
~>  'lines  faster  than  direct  methods.  A  recently  implemented  modification  based  on  a  waveform-Newton 
algorithm  increased  this  to  a  factor  of  from  5  to  11  [9.10]. 

Our  other  project  in  clas:  :al  device  simulation  is  in  developing  efficient  and  robust  numerical  algo¬ 
rithms  for  a  two-dimensional  semiconductor  device  simulator  that  includes  both  momentum  and  energy 
balance  equations.  Tracking  the  electron  energies  allow.,  for  a  more  accurate  characterization  of  both  hot 
electron  effects  and  substrate  currents.  The  program  developed  uses  a  fuP  Newton  method  to  compute 
potentials,  electron  concentrations,  and  electron  tempertures  on  a  grid  that  describes  the  device.  Ini¬ 
tial  simulation  results  on  a  MOSFET  were  close  to  what  was  expected  theoretically  and  what  had  been 
published  in  the  literature  by  other  icsearchers.  Because  of  the  reliability  of  the  algorithms  used  in  this 
urogram,  we  expert  to  he  able  to  examine  the  effects  of  a  wider  range  of  physical  models  for  mobility  and 
impart  ionization 

Simulation  of  small  geometry  devices  by  particle  simulation  or  Monte-Carlo  techniques  is  becoming 
increasingly  popular,  even  though  the  method  is  computationally  much  more  expensive  than  numerically 
solving  the  standard  or  mollified  drift-diffusion  equations  We  are  presently  investigating  alternative 
numerical  techniques  to  see  ifit,  is  possible  to  make  Monte-Carlo  simulation  less  computat  ional’y  expensive 
and  more  paralldrzable  In  particular,  we  are  investigating  the  interaction  between  the  particle  motions 
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m  i  'he  changes  m  the  electric  fields. 

1  hree  dimensional  capacitance  and  inductance  extraction  have  recently  become  important  because  the 
!-  ti-.--  packing  of  processors  and  the  memory  required  for  high  performance  parallel  computers  require 
■hr-  Imi'  ii'-mual  interconnection.  To  insure  an  interconnect  design  will  be  capable  of  achieving  desired 
•  rf  ;■  r : : an  nipling  capacitance  and  inductance  must  be  examined.  Over  the  past  year  we  developed 

i  it  lut  in . .  faction  algorithm  for  arbitrary  geometries  of  ideal  conductors  in  a  uniform  dielectric. 

lh”  tk  rithm  reduc.-s  the  calculation  complexity  from  order  n3 .  for  the  standard  algorithm,  to  order 
■;  where  a  is  the  number  of  tiles  the  into  which  conductor  surfaces  are  discretized.  The  algorithm  uses 
a  intimation  of  an  iterative  technique  and  a  multipole  expansion  algorithm.  The  initial  stages  in  the 
implementation  and  testing  of  a  fast  multipole  accelerated  conjugate  gradient  algorithm  for  extraction  of 
ip  a  nances  from  complex  three  dimensional  geometries  are  complete,  and  the  method  provides  nearly 
m  r  h  r  ef  magnitude  speed  improvement  of  the  standard  approach  with  as  few  as  eight  conductors  [.xj. 
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Abstract 


Express  cubes  are  k-ary  n-cube  interconnection  networks  augmented  by  express  channels  that 
provide  a  short  path  for  non-local  messages.  An  express  cube  combines  the  logarithmic  diameter 
of  an  indirect  network  with  the  wire-efficiency  and  ability  to  exploit  locality  of  a  direct  network. 
The  insertion  of  express  channels  reduces  the  network  diameter  and  thus  the  distance  component 
of  network  latency.  Wire  length  is  increased  allowine  networks  to  operate  with  latencies  that 
approach  the  physical  speed-of-light  limitation  rather  ..ian  being  limited  by  node  delays.  Express 
channels  increase  wire  bisection  in  a  manner  that  allows  the  bisection  to  be  controlled  independent 
of  the  choice  of  radix,  dimension,  and  channel  width.  By  increasing  wire  bisection  to  saturate 
the  available  wiring  media,  throughput  can  be  substantially  increased.  With  an  expres„  cube 
both  latency  and  throughput  are  wire-limited  and  within  a  small  factor  of  the  physical  limit 
on  performance.  Express  channels  may  be  inserted  into  existing  interconnection  networks  using 
interchanges.  No  changes  to  the  local  communication  controllers  are  required. 

1  Introduction 

Interconnection  networks  are  used  to  pass  messages  containing  data  and  synchronization  infor¬ 
mation  between  the  nodes  of  concurrent  computers  [1]  [2]  [16]  [17].  The  messages  may  be  sent 
between  the  processing  nodes  of  a  message-passing  multicomputer  [1]  or  between  the  processors 
and  memories  of  a  shared-memory  multiprocessor  [2]. 

An  interconnection  network  is  characterized  by  its  topology,  routing,  and  flow  control  [10].  The 
topology  of  a  network  is  the  arrangement  of  its  nodes  and  channels  into  a  graph.  Routing  de¬ 
termines  the  path  chosen  by  a  message  in  this  graph.  Flow  control  deals  with  the  allocation  of 
channel  and  buffer  resources  to  a  message  as  it  travels  along  this  path.  This  paper  deals  only 
with  topology.  Express  cubes  can  be  applied  independent  of  routing  and  flow  control  strategies. 

The  performance  of  a  network  is  measured  in  terms  of  its  latency  and  its  throughput.  The  latency 
of  a  message  is  the  elapsed  time  from  when  the  message  send  is  initiated  until  the  message  is 

'The  research  described  in  this  paper  was  supported  in  part  by  the  Defense  Advanced  Research  Projects  Agency 
under  contracts  N00014-88K-0738  and  N00014-87K-0825  and  in  part  by  a  National  Science  Foundation  Presidential 
Young  Investigator  Award  with  matching  funds  from  General  Electric  Corporation  and  IBM  Corporation. 
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completely  received.  Network  latency  is  the  average  message  latency  under  specified  conditions. 
Network  throughput  is  the  number  of  messages  the  network  can  deliver  per  unit  time. 

Low-dimensionai  k-asy  n-cube  networks  using  wormhole  routing  have  been  shown  to  provide  low 
latency  and  high  throughput  for  networks  that  are  wire-limited  [4]  [5]  [9].  For  n  <  3,  the  k-ary 
n-cube  topology  is  wire-efficient  in  that  it  makes  efficient  use  of  the  available  bisection  width.  This 
topology  maps  into  the  three  physical  dimensions  in  a  manner  that  allows  messages  to  use  all  of  the 
available  bandwidth  along  their  path  without  ever  having  to  double  back  on  themselves.  Also,  low¬ 
dimensional  k-asy  n-cubes  concentrate  bandwidth  into  a  few  wide  channels  so  that  the  component 
of  latency  due  to  message  length  is  reduced.  In  most  contemporary  concurrent  computers,  this  is 
the  dominant  component  of  latency.  Because  of  their  low-latency,  high  throughput,  and  affinity  for 
implementation  in  VLSI,  these  k-axy  n-cube  networks  with  n  =  2  or  3  have  been  used  successfully 
in  the  design  of  several  concurrent  computers  including  the  Ametek  2010  [17],  the  J-Machine  [7] 
[8],  and  the  Mosaic  [18]. 

However,  low-dimensional  k-asy  n-cube  interconnection  networks  have  two  significant  shortcom¬ 
ings: 

•  Because  wires  are  short,  node  delays  dominate  wire  delays  and  the  distance  related  compo¬ 
nent  of  latency  falls  more  than  an  order  of  magnitude  short  of  speed-of-light  limitations.  In 
the  J-Machine  [7],  for  example,  node  delay  is  50ns  while  the  longest  wire  is  225mm  and  has 
a  time-of-flight  delay  of  1.5ns. 

•  The  channel  width  of  these  networks  is  often  limited  by  node  pin  count  rather  than  by 
wire  bisection.  For  example,  the  J-Machine  channel  width  is  limited  to  9-bits  by  pin  count 
limitations.  In  the  physical  node  width  of  50mm,  a  6-layer  printed  circuit  board  can  handle 
over  four  times  this  channel  width  after  accounting  for  through  holes  and  local  connections. 

In  short,  many  regular  k-aiy  n-cube  interconnection  networks  are  node-limited  rather  than  wire- 
limited.  In  these  networks,  node  delay  and  pin  limitations  dominate  wire  delay  and  wire  density 
limitations.  The  ratios  of  node  delay  to  wire  delays  and  pin  density  to  wire  density  cannot  be 
balanced  in  a  regular  k- ary  n-cube. 

Express  cubes  overcome  this  problem  by  allowing  wire  length  and  wire  density  to  be  adjusted 
independently  of  the  choice  of  radix,  k,  dimension,  n,  and  channel  width,  W .  An  express  cube 
is  a  fc-ary  n-cube  augmented  by  one  or  more  levels  of  express  channels  that  allow  non-local 
messages  to  bypass  nodes.  The  wire  length  of  the  express  channels  can  be  increased  to  the 
point  that  wire  delays  dominate  node  delays.  The  number  of  express  channels  can  be  adjusted  to 
increase  throughput  until  the  available  wiring  media  is  saturated.  This  ability  to  balance  node  and 
wire  limitations  is  achieved  without  sacrificing  the  wire-efficiency  of  k-asy  n-cube  networks.  The 
number  of  channels  traversed  by  a  message  in  a  hierarchical  express  cube  grows  logarithmically 
with  distance  as  in  a  multistage  interconnection  network  [11][19].  The  express  cube,  however,  is 
able  to  exploit  locality  while  in  a  multistage  network  all  messages  must  traverse  the  diameter  of 
the  network.  With  an  express  cube,  both  latency  and  throughput  are  wire  limited  and  are  wuhin 
a  small  constant  factor  of  the  physical  limit  on  performance. 

The  remainder  of  this  paper  describes  the  express  cube  topology  and  analyzes  its  performance. 
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Section  2  summarizes  the  notation  that  will  be  used  throughout  the  paper.  Section  3  introduces 
the  express  cube  topology  in  steps.  Basic  express  cubes  (Section  3.1)  reduce  latency  to  twice  the 
delay  of  a  dedicated  wire  for  messages  traveling  long  distances.  Throughput  can  be  increased  to 
saturate  the  available  wiring  density  by  adding  multiple  express  channels  (Section  3.2).  With  a 
hierarchical  express  cube  (Section  3.3),  latency  for  short  distances,  while  node-limited,  is  within 
a  small  constant  factor  of  the  best  that  can  be  achieved  by  any  bounded  degree  network.  Some 
design  considerations  for  express  cube  interchanges  are  discussed  in  Section  4. 


2  Notation 


The  following  symbols  are  used  in  this  paper.  They  are  listed  here  for  reference. 

C,  the  set  of  channels  in  the  network. 

D ,  manhattan  distance  traveled  by  a  message,  \x,  -  Xd\  +  |y,  -  yd\  +  | z,  -  Zd\,  where 
the  source  is  at  (x,,y,,z,)  and  the  destination  is  at  (id,  yd,  *d)- 

H  hops .  the  number  of  nodes  traversed  by  a  message. 

j,  number  of  nodes  between  interchanges  in  an  express  cube. 

k,  the  radix  of  the  network  -  the  length  in  each  dimension. 

/,  the  number  of  levels  of  hierarchy  in  a  hierarchical  express  cube. 

L,  the  message  length  in  bits, 
n,  the  dimension  of  the  network. 

iV,  the  set  of  nodes  in  the  network.  Where  it  is  unambiguous,  N  is  also  used  for  the 
number  of  nodes  in  the  network,  \N\. 

Tn,  the  latency  of  a  node. 

Tw,  the  latency  of  a  wire  that  connects  two  physically  adjacent  nodes. 

Tp,  the  pipeline  period  of  a  node. 

W,  the  width  of  a  channel  in  bits. 

a,  the  ratio  of  node  latency  to  wire  latency,  Tn/Tw. 

Communication  between  nodes  is  performed  by  sending  messages.  A  message  may  be  broken 
into  one  or  more  packets  for  transmission.  A  packet  is  the  smallest  unit  that  contains  routing 
and  sequencing  information.  Packets  contain  one  or  more  flow  control  digits  or  flits.  A  flit  is 
the  smallest  unit  on  which  flow  control  is  performed.  A  flit  in  turn  is  composed  of  one  or  more 
physical  transfer  units  or  phits2.  A  phit  is  W-bits,  the  size  of  the  physical  communication  media. 

An  interconnection  network  consists  of  a  set  of  nodes,  N,  that  are  connected  by  a  set  of  channels, 
C  C  N  x  iV,  Each  channel  is  unidirectional  and  carries  data  from  a  source  node  to  a  destination 
node.  For  the  purposes  of  this  paper  it  is  assumed  that  the  network  is  bidirectional:  channels 
occur  in  pairs  so  that  (rtj.nj)  €  C  =>  (nj.rii)  €  C. 

’There  ia  no  constraint  that  the  physical  unit  of  transfer,  phit,  most  be  smaller  than  the  Sow  control  noit,  Sit. 
It  is  possible  to  construct  systems  with  several  Sits  in  each  phit. 
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Figure  1:  Insertion  of  express  channels  reduces  latency:  (A)  A  regular  fc-ary  1-cube  network  may 
be  dominated  by  node  delay,  (B)  A  fc-ary  1-cube  with  express  channels  reduces  the  node  delay 
component  of  latency. 


3  Express  Cubes 

3.1  Express  Channels  Reduce  Latency 

Figure  1  illustrates  the  application  of  express  channels  to  a  k- ary  1-cube  or  linear  array.  A  regular 
fc-ary  1-cube  is  shown  in  Figure  1A.  The  network  is  linear  array  of  k  processing  nodes,  labeled  N, 
each  connected  to  its  nearest  neighbors  by  channels  of  width  W .  The  delay  of  a  phit  propagating 
through  a  node  is  Tn.  The  delay  of  the  wire  connecting  two  nodes  is  Tw.  Each  channel  can  accept 
a  new  phit  every  Tp.  The  latency  of  a  message  of  length  L  sent  distance  D  is 

T„  =  HTn  +  DTW  +  — Tp  =  (Tn  +  TW)D  +  (1) 

Message  latency  is  composed  of  three  components  as  shown  in  equation  (1).  The  first  component 
is  the  node  latency,  due  to  the  number  of  hops,  H.  The  second  component  is  the  wire  latency,  due 
to  the  distance  D.  The  third  component  is  due  to  message  length,  L.  For  a  conventional  fc-ary 
n-cube,  H  —  D  and  since  for  most  networks  Tn  >>  Tw,  the  node  latency  dominates  the  wire 
latency.  Express  cubes  reduce  the  node  latency  by  increasing  wire  length  to  reduce  the  number 
of  hops,  H. 

An  express  k- ary  1-cube  is  shown  in  Figure  IB.  Express  channels  have  been  added  to  the  array 
by  inserting  an  interchange,  labeled  I,  every  i  nodes.  An  interchange  is  not  a  processing  node. 
It  performs  only  communication  functions  and  is  not  assigned  an  address.  Each  interchange 
is  connected  to  its  neighboring  interchanges  by  an  additional  channel  of  width  W,  the  express 
channel.  When  a  message  arrives  at  an  interchange  it  is  routed  directly  to  the  next  interchange  if 
it  is  not  destined  for  one  of  the  intervening  nodes.  To  preserve  the  wire-efficiency  of  the  network, 
messages  are  never  routed  past  their  destinations  on  the  express  channels  even  though  doing  so 
would  reduce  H  in  many  cases. 

The  delay,  Tn,  and  throughput,  1  /Tp,  of  an  interchange  are  assumed  to  be  identical  to  those  of 
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a  node.  The  wire  delay  of  the  express  channel  is  assumed  to  be  iTw.  To  simplify  the  following 
analysis,  it  is  assumed  that  interchanges  add  no  physical  distance  to  the  network.  Assuming  i\D, 
H  =  D /i  +  i  and  insertion  of  express  channels  reduces  the  latency  to 


rl=(?  +  j)r„  +  r„D  +  ^.  (2) 

In  the  general  case,  an  average  message  traversing  D  processing  nodes  travels  over  H,  =  (i  +  l)/2 
local  channels  to  reach  an  interchange,  He  =  [D/i  -  1/2  +  1/(2:)J  express  channels  to  reach  the 
last  interchange  before  the  destination,  and  finally  H j  =  (D  -  i/2  +  1/2)  mod  »'  local  channels  to 
the  destination.  The  total  number  of  hops  is  H  =  Hi  +  He  +  H j  giving  a  latency  of 


Tb 


D  _  i  j_ 
«  2  +  2: 


+ 


mod  t^Tn  +  DTW  +  ^ . 


(3) 


For  large  distances,  D  >>  a  =  Tn/Tw ,  choosing  i  =  a  balances  the  node  and  wire  delay.  With 
this  choice  of  i ,  the  latency  due  to  distance  is  approximately  twice  the  wire  latency,  To  ss  2TWD. 
The  latency  for  large  distances  of  large  express  channel  network  with  i  =  a  is  within  a  factor  of 
two  of  the  latency  of  a  dedicated  manhattan  wire  between  the  source  and  destination3. 

For  small  distances  or  large  a,  the  i  term  in  the  coefficient  of  Tn  in  equation  (2)  is  significant  and 
node  delay  dominates.  For  such  networks,  latency  is  minimized  by  choosing  i  =  \f~D  resulting  in 
Tq  w  2 {yJ~D  —  l)Tn.  The  use  of  hierarchical  express  channels  (Section  3.3)  can  further  improve 
the  latency  for  small  distances. 


3.2  Multiple  Express  Channels  Increase  Throughput  to  Saturate  Wire  Density 

To  first  order,  network  throughput  is  proportional  to  wire  bisection  and  hence  wire  density.  If  more 
wires  are  available  to  transmit  data  across  the  network,  throughput  will  be  increased  provided 
that  routing  and  flow  control  strategies  are  able  to  profitably  schedule  traffic  onto  these  wires. 
Many  regular  network  topologies,  such  as  low-dimensional  £-ary  n-cubes,  axe  unable  to  make  use 
of  all  available  wire  density  because  of  pin  limitations.  The  wire  bisection  of  an  express  cube  can 
be  controlled  independent  of  the  choice  of  radix,  k,  dimension,  n,  or  channel  width,  W  by  adding 
multiple  express  channels  to  the  network  to  match  network  throughput  with  the  available  wiring 
density. 

Figure  2  shows  two  methods  of  inserting  multiple  express  channels.  Multiple  express  channels 
may  be  handled  by  each  interchange  as  shown  in  Figure  2A.  Alternatively,  simplex  interchanges 
can  be  interleaved  as  shown  in  Figure  2B. 

In  method  A,  using  multipie  channel  interchanges,  an  interchange  is  inserted  every  i  nodes  as  above 
and  each  interchange  is  connected  to  its  neighbors  using  m  parallel  express  channels.  Figure  2A 
shows  a  network  with  i  =  4  and  m  =  2.  The  interchange  acts  as  a  concentrator  combining 

3There  is  nothing  special  about  the  factor  of  two.  By  choosing  i  =  j  or  the  distance  component  oflatency  will 
be  (1  4-  1  / 1)  times  the  latency  of  a  manhattan  wire. 
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Figure  2:  Multiple  express  channels  allow  wire  density  to  be  increased  to  saturate  the  available 
wiring  media.  Express  channels  can  be  added  using  either  (A)  interchanges  with  multiple  express 
channels,  or  (B)  interleaved  simplex  interchanges. 


messages  arriving  on  the  m  incoming  express  channels  with  non-local  messages  arriving  on  the 
local  channel  and  concentrating  these  message  streams  onto  the  m  outgoing  express  channels. 
This  method  has  the  advantage  of  making  better  use  of  the  express  channels  since  any  message 
can  route  on  any  express  channel.  Flexibility  in  express  channel  assignment  is  achieved  at  the 
expense  of  higher  pincount  and  limited  expansion. 

With  method  B,  interleaving  simplex  interchanges,  m  simplex  interchanges  are  inserted  into  each 
group  of  i  nodes.  Each  interchange  is  connected  to  the  corresponding  interchange  in  the  next  group 
by  a  single  express  channel.  All  messages  from  the  nodes  immediately  before  an  interchange  will  be 
routed  on  that  interchange’s  express  channels.  Because  load  cannot  be  shared  among  interleaved 
express  channels,  an  uneven  distribution  of  traffic  may  result  in  some  channels  being  saturated 
while  parallel  channels  are  idle.  Method  B  has  the  advantage  of  using  simple  interchanges  and 
allowing  arbitrary  expansion.  In  the  extreme  case  of  inserting  an  interchange  between  every  pair 
of  nodes  the  resulting  topology  is  almost  the  same  as  the  topology  that  would  result  from  doubling 
the  number  of  dimensions. 

Both  of  the  methods  illustrated  in  Figure  2  have  the  effect  of  increasing  the  wire  density  (and 
bisection)  by  a  factor  of  m  -|-  1.  To  first  order,  network  throughput  will  increase  by  a  similar 
amount.  There  will  be  some  degradation  due  to  uneven  loading  of  parallel  channels. 

The  use  of  multiple  express  channels  offsets  the  load  imbalance  between  express  and  local  channels. 
If  traffic  is  uniformly  distributed,  the  average  fraction  of  messages  crossing  a  point  in  the  network 
on  a  local  channel  is  P;  =  2 i/k  as  compared  to  Pe  =  (k  —  2 i)/k  crossing  on  an  express  channel. 
For  large  networks  where  k  >>  i,  the  bulk  of  the  traffic  is  on  express  channels.  Increasing  the 
number  of  express  channels  applies  more  of  the  network  bandwidth  where  it  is  most  needed. 

Multiple  express  channels  are  an  effective  method  of  increasing  throughput  in  networks  where  the 
channel  width  is  limited  by  pinout  constraints.  For  example,  in  the  J-Machine  the  channel  width, 
W  =  9,  is  set  by  pin  limitations4.  The  printed-circuit  board  technology  is  capable  of  running  80 

'  Each  J-Machine  node  ia  packaged  in  a  168-pin  PGA.  The  sin  communication  channel*  each  require  9  data  bit* 
and  6  control  bit*  consuming  90  of  these  pins.  Power  connection*  u*e  48  pin*.  The  remaining  30  pin*  are  used  by 
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Figure  3:  Hierarchical  express  channels  reduce  latency  due  to  local  routing. 


wires  in  each  dimension  across  the  '  1mm  width  of  a  node.  Even  with  many  of  these  wires  used 
for  local  connections,  four  parallel  15-bit  (data-fcontrol)  wide  channels  can  be  easily  run  across 
each  node.  A  multiple  express  channel  network  with  m  =  3  could  use  this  avail: ble  wire  density 
to  quadruple  the  throughput  of  the  network. 

3.3  Hierarchical  Express  Cubes  Have  Logarithmic  Node  Delay 

With  a  single  level  of  express  channels,  an  average  of  i  local  channels  are  traversed  by  each 
non-local  message.  The  node  delay  on  these  local  channels  represents  a  significant  component 
of  latency  and  causes  networks  with  short  distances,  D  <  a2,  to  be  node  limited.  Hierarchical 
express  cubes  overcome  this  limitation  by  using  several  levels  of  express  channels  to  make  node 
delay  increase  logarithmically  with  distance  for  short  distances. 

The  use  of  hierarchical  express  channels,  shown  in  Figure  3,  reduces  the  latency  due  to  node 
delay  on  local  channels.  With  hierarchical  express  channels,  there  are  l  levels  of  interchanges.  A 
first-level  interchange  is  inserted  every  i  nodes.  A  second-level  interchange  replaces  every  ith  first 
level  interchange,  every  i2  nodes,  in  general,  a  jth  level  interchange  replaces  every  »th  i  -  1“  level 
interchange,  every  i3  nodes5.  Figure  3  illustrates  a  hierarchical  express  cube  with  i  =  2,  /  =  2. 

A  jth  level  interchange  has  j+  1  inputs  and  ;  4- 1  outputs.  Arriving  messages  are  treated  identically 
regardless  of  the  input  on  which  they  arrive.  Messages  that  are  destined  for  one  of  the  next  * 
nodes  a.e  routed  to  the  local  (0th)  output.  Those  remaining  messages  that  are  destined  for  one 
of  the  next  »2  nodes  are  routed  to  the  1“  output.  The  process  continues  with  all  messages  with 
a  destination  between  tp  and  ip+*  nodes  away,  0  <  p  <  j  -  1,  routed  to  the  pth  output.  All 
remaining  messages  are  routed  to  the  ;th  output. 

A  message  in  a  hierarchical  express  cube  is  delivered  in  three  phases:  ascent,  cruise,  and  descent. 
Ln  the  ascent  phase,  an  average  message  travels  (i  +  l)/2  hops  to  gef  to  the  first  interchange, 
and  (i  -  l)/2  hops  at  each  level  for  a  total  of  =  (i  -  1  )//2  +  1  hops  and  a  distance  of 
Da  ~  -  l)/2.  During  the  cruise  phase,  a  message  travels  Hc  =  [(D  -  D„)/i,J  hops  on  level 

external  memory  interface  and  control. 

‘This  contraction  yield*  a  fixed-radix  express  cube,  with  radix  i  for  each  level.  It  i*  also  possible  to  construct 
mixed-radix  express  cubes  where  the  radix  vanes  from  level  to  level. 
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Figure  4:  Hierarchical  interchanges  (A)  a  third-level  interchange.  (B)  a  third-level  interchange 
implemented  from  first-level  interchanges.  (C,D)  With  a  small  performance  penalty,  ascending 
and/or  descending  interchanges  can  be  eliminated. 


I  channels  for  a  distance  of  Dc  =  i1  Hc.  Finally,  the  message  descends  back  through  the  levels 
routing  on  each  level,  ;,  as  long  as  the  remaining  distance  is  greater  than  i3 .  For  the  special  case 
where  il\D,  the  descending  message  takes  Hj  =  (i  —  1)1/2+  1  hops  for  a  distance  of  Dd  =  ( il  + 1)/2. 
This  gives  a  latency  of 


/  n  \  LT 

Tc  =  +  (i-l)/  +  ljrn  +  TWD  +  (4) 

Choosing  i  and  l  so  that  i1  =  a  balances  node  and  wire  delay  for  large  distances.  With  this  choice, 
the  delay  due  to  local  nodes  is  (» -  i)lTn  =  (i  -  1)  log,-  aTn  which  is  a  minimum  for  t  =  e.  While  3 
is  the  closest  integer  to  e,  a  choice  of  i  =  4  is  preferred  to  facilitate  decoding  of  binary  addresses 
in  interchanges,  and  networks  with  t  =  8  or  t  =  16  may  be  desirable  under  some  circumstances. 

In  the  general  case,  i1  / D ,  the  latency  of  a  hierarchical  express  cube  is  calculated  by  representing 
the  source  and  destination  coordinates  as  h  =  log,  fc-cligit  radix-i  numbers,  S  =  ••••Joi  a-nd 

D  =  d\^i  ■  -do-  WLOG  we  assume  that  S  <  D.  During  the  ascent  phase,  a  message  routes 
from  S  to  s/,_i  •  •  •  s/+10  •  •  •  0  taking  Ha  =  ((’  ~  sj)  m°d  *)  h°P8  for  a  distance  of  Da  = 

X^;=o((*  "  3j)  m°d  «)»J.  The  cruise  phase  takes  the  message  He  =  (d;  -  Sj) hops  for  a 

distance  of  Dc  =  fZc»,.  Finally,  th^  descent  phase  takes  the  message  from  •  -  -  <f(0  •  -  -  0  to  D 
taking  Hd  =  g  dj  hops  for  a  distance  of  Dd  =  f'or  short  distances  the  cruise  phase 

will  never  be  reached.  The  message  will  move  from  ascent  to  descent  as  soon  as  it  reaches  a  node 
where  all  non-zero  coordinates  agree  with  D.  The  total  latency  for  the  general  case  is  plotted  as 
a  function  of  distance  in  Figure  5. 

Figure  4  shows  how  hierarchical  interchanges  can  be  implemented  using  pin-bounded  modules.  A 
level-;  interchange  requires  j  +  1  inputs  and  outputs  if  implemented  as  a  single  module  as  shown 
for  a  third  level  interchange  in  Figure  4 A.  A  level-;  interchange  can  be  decomposed  into  2;  -  1 
level-one  interchanges  as  shown  for  ;  =  2  in  Figure  4B.  A  series  of  ;  -  1  ascending  interchanges 
that  route  non-local  traffic  toward  higher  levels  is  followed  by  a  top-level  interchange  and  a  series 
of  ;  -  1  descending  interchanges  that  allow  local  traffic  to  descend.  With  some  degradation  in 
performance,  the  ascending  interchanges  can  be  eliminated  as  shown  in  Figure  4C.  This  change 
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Figure  5:  Latency  as  a  function  of  distance  for  a  hierarchical  express  channel  cube  with  t  =  4, 
/  =  3,  a  =  64,  and  a  flat  express  channel  cube  with  i  =  16,  a  =  64.  In  a  hierarchical  express 
channel  cube  latency  is  logarithmic  for  short  distances  and  linear  for  long  distances.  The  crossover 
occurs  between  D  =  a  and  D  =  ia  log,  a.  The  flat  cube  has  linear  delay  dominated  by  Tn  for 
short  distances  and  by  Tw  for  long  distances. 


requires  extra  hops  in  some  cases  as  a  message  cannot  skip  levels  on  its  way  up  to  a  high-level 
express  channel.  Each  message  must  traverse  at  least  one  level  j  —  1  channel  before  being  switched 
to  a  level-,;'  channel.  By  restricting  messages  to  also  travel  on  at  least  one  channel  at  each  level 
as  they  descend,  the  descending  interchanges  can  be  eliminated  as  well  leaving  only  the  single 
top-level  interchange  as  shown  in  Figure  4D. 


3.4  Performance  Comparison 

Figure  5  shows  how  latency  varies  with  distance  in  hierarchical  and  flat  express  cubes  and  com¬ 
pares  these  latencies  with  the  latency  of  a  conventional  fc-ary  1-cube  and  of  a  direct  wire.  These 
curves  assume  that  the  message  source  is  midway  between  two  interchanges.  The  latencies  are 
normalized  to  units  of  the  wire  delay  between  adjacent  nodes.  The  latency  of  a  conventional  k- ary 
1-cube  is  linear  with  slope  a  while  the  latency  of  a  wire  is  linear  with  slope  1. 

i 
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(A)  (B) 


Figure  6:  A  multidimensional  express  cube  may  be  constructed  either  by  (A)  inserting  inter¬ 
changes  into  each  dimension  separately,  or  (B)  interleaving  multi-dimensional  interchanges  into 
the  array. 


For  short  distances,  until  the  first  express  channel  is  reached,  a  flat  (non-hierarchical)  express  cube 
has  the  same  delay  as  a  conventional  fc-ary  n-cube,  To  =  ctD.  Once  the  message  begins  traveling 
on  express  channels,  latency  increases  linearly  with  slope  1  +  a/i.  This  occurs  at  distance  D  =  24 
in  the  figure.  There  is  a  periodic  variation  in  delay  around  this  asymptote  due  to  the  number  of 
local  channels  being  traversed,  Aocal  =  (*  +  l)/2  +  (( D  -  i/2  +  1/2)  mod  i). 

The  hierarchical  express  cube  has  a  latency  that  is  logarithmic  for  short  distances  and  linear  for 
long  distances.  The  latency  of  messages  traveling  a  short  distance,  D  <  a  is  node  limited  and 
increases  logarithmically  with  distance,  Tp  **  (t -  1)  log,  DTn.  This  delay  is  within  a  factor  of  i  -  1 
of  the  best  that  can  be  achieved  with  radix  i  switches.  Long  distance  messages  have  a  latency 
of  7b  *  (1  -F  a/il)Tw.  If  il  =  a,  this  long  distance  latency  is  approximately  twice  the  latency 
of  a  dedicated  manhattan  wire.  In  a  hierarchical  network,  the  interchange  spacing,  »,  can  be 
made  small,  giving  good  performance  for  short  distances,  without  compromising  the  delay  of  long 
distance  messages  which  depends  on  the  ratio  a/i1.  In  a  flat  network  with  a  single  parameter,  :, 
it  is  not  possible  to  simultaneously  optimize  performance  for  both  short  and  long  distances. 


3.5  Express  Channels  in  Many  Dimensions 

A  multidimensional  express  cube  may  be  constructed  by  inserting  interchanges  into  each  dimen¬ 
sion  separately  as  shown  in  Figure  6A.  The  figure  shows  part  of  a  two-dimensional  express  cube 
with  i  =  4,  /  =  1.  Interchanges  have  been  inserted  separately  into  the  X  and  Y  dimensions.  A 
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Figure  7:  Interchanges  allow  wire  density,  speed,  and  signalling  levels  to  be  changed  at  module 
boundaries. 


similar  construction  can  be  realized  for  higher  dimensions  and  for  hierarchical  networks.  With  this 
approach  interchange  pin-count  is  minimal  as  each  interchange  handles  only  a  single  dimension. 
Also,  the  design  is  easy  to  package  into  modules  as  the  interchanges  are  located  in  regular  rows 
and  columns.  This  approach  has  the  disadvantage  that  messages  must  descend  to  local  channels 
to  switch  dimensions. 

An  alternate  construction  of  a  multidimensional  express  cube  is  to  interleave  multidimensional 
interchanges  into  the  array  as  shown  in  Figure  6B  for  i  =  l  =  1.  This  approach  allows  messages 
on  express  channels  to  change  dimensions  without  descending  to  a  local  channel.  It  is  particularly 
useful  in  networks  that  use  adaptive  routing  [13}[14]  as  it  provides  alternate  paths  at  each  level  of 
the  network.  The  interleaved  construction  has  the  disadvantages  of  requiring  a  higher  interchange 
pincount  and  being  more  difficult  to  package  into  modules. 

3.6  Modularity 

The  interchanges  in  an  express  cube  can  be  used  to  change  wire  density,  speed,  and  signalling 
levels  at  module  boundaries  as  shown  in  Figure  7.  Large  networks  are  built  from  many  modules 
in  a  physical  hierarchy.  A  typical  hierarchy  includes  integrated  circuits,  printed  circuit  boards, 
chassis,  and  cabinets.  Available  wire  density  and  bandwidth  change  significantly  between  levels 
of  the  hierarchy.  For  example,  a  typical  integrated  circuit  has  a  wire  density  of  250  wires/mm  per 
layer  while  a  printed  circuit  board  can  handle  only  2  wires/mm  per  layer6.  Interchanges  placed  at 
module  boundaries  as  shown  in  Figure  7  can  be  used  to  vary  the  number  and  width  of  express  and 
local  channels.  These  boundary  interchanges  may  also  convert  internal  module  signalling  levels 
and  speeds  to  levels  and  speeds  more  appropriate  between  modules.  Using  express  channels  and 
boundary  interchanges,  the  network  can  be  adjusted  to  saturate  the  available  wiring  density  even 
though  this  density  is  not  uniform  across  the  packaging  hierarchy.  To  make  use  of  the  available 
bandwidth,  computations  running  on  the  network  must  exploit  locality. 

*This  integrated  circuit  wire  density  is  typical  of  first-level  metal  in  a  lp  CMOS  process.  The  printed  circuit 
wire  density  is  for  a  board  with  8mil  wires  and  spaces.  Both  densities  assume  ail  area  is  available  for  wiring. 
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Figure  8:  Block  diagram  of  an  interchange.  Two  multiplexors  perform  switching  between  input 
and  output  registers  based  on  a  comparison  of  the  high  address  bits  in  a  message  header. 


4  Interchange  Design 

Figure  8  shows  the  block  diagram  of  a  unidirectional  interchange.  A  bidirectional  interchange 
includes  an  identical  circuit  in  the  opposite  direction.  The  basic  design  is  similar  to  that  of  a 
router  [15][6][3].  Two  input  latches  hold  arriving  flits  and  two  output  latches  hold  departing  flits. 
If  additional  buffering  is  desired,  any  of  these  latches  may  be  replaced  by  a  FIFO  buffer.  If  a  phit 
is  a  different  size  than  a  flit,  multiplexing  and  demultiplexing  is  required  between  the  flit  buffers 
and  the  interchange  pins.  Associated  with  each  output  latch  is  a  multiplexor  that  selects  which 
input  is  routed  to  the  latch.  Routing  decisions  are  made  by  comparing  the  address  information 
in  the  head  flit(s)  of  the  message  to  the  local  address.  If  the  destination  lies  within  the  next  » 
nodes,  the  local  channel  is  chosen,  otherwise  the  express  channel  is  chosen.  If  i  is  a  power  of  two, 
interchanges  are  aligned,  and  absolute  addresses  are  used  in  headers,  the  comparison  can  be  made 
by  checking  all  but  the  l  log2  t  least  significant  bits  for  equality  to  the  local  address. 

The  interchange  state  includes  presence  bits  for  each  register,  an  input  state  for  each  input,  and 
an  output  state  for  each  output.  The  presence  bits  are  used  for  flit-level  flow  control.  A  flit  is 
allowed  to  advance  only  if  the  presence  bit  of  its  destination  register  is  clear  (no  data  present),  or 
if  the  register  is  to  be  emptied  in  the  same  cycle.  The  input  state  bits  hold  the  destination  port 
and  status  (empty,  head,  advancing,  blocked)  of  the  message  currently  using  each  input.  The 
output  state  consists  of  a  bit  to  identify  whether  the  output  is  busy  and  a  second  bit  to  identify 
whkh  input  has  been  granted  the  output.  The  combinational  logic  to  maintain  these  state  bits 
and  control  the  data  path  is  straightforward. 
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5  Conclusion 


Express  cubes  are  Ar-ary  rc-cubes  augmented  by  express  channels  that  provide  a  short  path  for  non¬ 
local  messages.  An  express  cube  retains  the  wire  efficiency  of  a  conventional  ifc-ary  n-cube  while 
providing  improved  latency  and  throughput  that  are  limited  only  by  the  wire  delay  and  available 
wire  density.  For  short  distances,  a  hierarchical  express  cube  has  a  latency  that  is  within  a  small 
factor  of  the  best  that  can  be  achieved  with  a  bounded  degree  network.  For  long  distances,  the 
latency  can  be  made  arbitrarily  close  to  that  of  a  dedicated  manhattan  wire.  Multiple  express 
channels  can  be  used  to  mc^ease  thro”ghpat  tc  the  limit  of  the  available  wire  Jensil/.  The  express 
cube  combines  the  low  diameter  of  multistage  interconnection  networks  with  the  wire  efficiency 
and  ability  to  exploit  locality  of  a  direct  network.  The  result  is  a  network  with  latency  and 
throughput  that  are  within  a  small  factor  of  the  physical  limit. 

Express  channels  are  added  to  a  ifc-ary  n-cube  by  periodically  inserting  interchanges  into  each 
dimension.  No  modifications  are  required  to  the  routers  in  each  processing  node;  express  channels 
can  be  added  to  most  existing  fc-ary  n-cube  networks.  Interchanges  also  allow  wire  density,  speed, 
and  signalling  levels  to  be  changed  at  module  boundaries.  An  express  cube  can  make  use  of  all 
available  wire  density  even  if  the  wire  density  is  non-uniform.  This  is  often  required  as  the  wire 
density  and  speed  may  change  significantly  between  levels  of  packaging. 

Express  cubes  achieve  their  performance  at  the  cost  of  adding  interchanges,  increasing  the  latency 
for  some  short-distance  messages,  and  increasing  the  bisection  width  of  the  network.  Each  inter¬ 
change  adds  a  component  to  the  system  and  increases  the  latency  of  local  messages  that  cross  an 
interchange  but  do  not  take  the  express  channel  by  one  node  delay,  (T„  +  Tw).  Express  channels 
increase  the  wire  bisection  by  using  available  unused  wiring  capacity.  In  parts  of  the  network  that 
are  already  wire-limited  the  express  and  local  channels  cam  be  combined  as  shown  in  Figure  7. 

As  the  performance  of  interconnection  networks  approaches  the  limits  of  the  underlying  wiring 
media  their  range  of  application  increases.  These  networks  can  go  beyond  exchanging  messages 
between  the  nodes  of  concurrent  computers  to  serving  as  a  general  interconnection  media  for 
digital  electronic  systems.  For  distances  larger  than  D'  =  at  log,- a,  the  delay  of  a  hierarchical 
express  cube  network  i3  within  a  factor  of  three  of  that  of  a  dedicated  wire.  The  network  may 
provide  better  performance  than  the  wire  because  it  is  able  to  share  its  wiring  resources  among 
many  paths  in  the  network  while  a  dedicated  wire  serves  only  a  single  source  and  destination. 
For  distances  smaller  than  D' ,  dedicated  wiring  offers  a  significant  latency  advantage  at  the  cost 
of  eliminating  resource  sharing. 
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Abstract 

This  paper  explores  priority  arbitration  schemes  that  employ  busses  to  arbitrate 
among  n  modules  in  a  digital  system.  We  focus  on  distributed  mechanisms  that 
employ  m  busses,  for  lg  n  <  m  <  n,  and  use  asynchronous  combinational  arbitration 
logic.  A  widely  used  distributed  asynchronous  mechanism  is  the  binary  arbitration 
scheme,  which  with  m  =  lgra  busses  arbitrates  in  t  =  lg  n  units  of  time.  We  present 
a  new  asynchronous  scheme  —  binomial  arbitration  —  that  by  using  m  =  lg  n  +  1 
busses  reduces  the  arbitration  time  to  t  =  ^  lg n.  Extending  this  result,  we  present 
the  generalized  binomial  arbitration  scheme  that  achieves  a  bus-time  tradeoff  of  the 
form  m  =  0(<n1/,()  between  the  number  of  arbitration  busses  m  and  the  arbitration 
time  t  (in  units  of  bus-settling  delay),  for  values  of  1  <  t  <  lg  n  and  lg  n  <  m  <  n. 
Our  schemes  are  based  on  a  novel  analysis  of  data-dependent  delays  and  generalize 
the  two  known  schemes:  linear  arbitration,  which  with  m  =  n  busses  achieves  t  =  1 
time,  and  binary  arbitration,  which  with  m  =  lg  n  busses  achieves  t  =  lg  n  time.  Most 
importantly,  our  schemes  can  be  adopted  with  no  changes  to  existing  hardware  and 
protocols;  they  merely  involve  selecting  a  good  set  of  priority  arbitration  codewords. 

Keywords:  arbitration,  arbitration  priorities,  asynchronous  arbitration,  binary  ar¬ 
bitration,  binomial  arbitration,  busses,  bus-settling  delay,  combinational  logic,  data- 
dependent  delays,  generalized  binomial  arbitration,  linear  arbitration,  open-collector 
busses,  priority  arbitration,  resource  tradeoff,  wired-OR. 
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1  Introduction 


In  many  electronic  systems  there  are  situations  where  several  modules  wish  to  use  a  com¬ 
mon  resource  simultaneously.  Examples  include  microprocessor  systems  where  a  decision  is 
required  concerning  which  of  several  interrupts  to  service  first,  multiprocessor  environments 
where  several  processors  wish  to  use  some  device  concurrently,  and  data  communication 
networks  with  shared  media.  To  resolve  conflicts,  an  arbitration  mechanism  is  required 
that  grants  the  resource  to  one  module  at  a  time. 

Numerous  arbitration  mechanisms  have  been  developed,  including  daisy  chains,  prioritv 
circuits,  polling,  token  passing,  and  carrier  sense  protocols,  to  name  a  few  (see  [5.  6.  10. 
14.  18.  19.  22.  26]).  In  this  paper  we  focus  on  distributed  priority  arbitration  mechanisms, 
where  contention  is  resolved  using  predetermined  module  priorities  and  the  arbitration 
process  is  carried  out  in  a  distributed  manner  at  all  the  system  modules.  In  many  modern 
systems,  and  especially  in  multiprocessor  environments  and  data  communication  networks, 
distributed  priority  arbitration  is  the  preferred  mechanism. 

Many  distributed  arbitration  mechanisms  employ  a  collection  of  arbitration  busses  to 
implement  priority  arbitration.  To  this  end.  each  module  is  assigned  a  unique  arbitration 
priority,  which  is  an  encoding  of  its  name.  An  arbitration  protocol  determines  the  logic 
values  that  a  module  applies  to  the  busses,  based  on  the  module's  arbitration  priority 
and  on  logic  values  on  other  busses.  After  some  delay,  the  settled  logic  values  on  the 
busses  uniquely  identify  the  contending  module  with  the  highest  priority.  In  particular, 
the  asynchronous  binary  arbitration  scheme,  developed  by  Taub  [23],  gained  popularity 
and  is  used  in  many  modern  bus  systems,  such  as  Futurebus  [7,  25],  M3-bus  [9].  S-100 
bus  [13,  24].  Multibus-II  [14],  Fastbus  [15],  and  Nubus  [28].  Other  priority  arbitration 
mechanisms  that  employ  busses  are  described  in  [5.  6,  10.  12,  17.  18,  19.  22,  26]. 

The  asynchronous  binary  arbitration  scheme  arbitrates  among  n  modules  in  t  =  Ig  n 
units  of  time,  using  m  =  Ig  n  open-collector  (wired-OR)  arbitration  busses.1  The  technol¬ 
ogy  of  open-collector  busses  is  such  that  the  default  logic  value  on  a  bus  is  0.  unless  at  least 
one  module  applies  a  1  to  it,  in  which  case  it  becomes  a  1.  Open-collector  busses,  thus.  OR 
together  the  logic  values  applied  to  them,  with  some  time  delay  called  bus-settling  delay. 
In  asynchronous  binary  arbitration,  each  module  is  assigned  a  unique  (lgn)-bit  arbitration 
priority.  When  arbitration  begins,  competing  modules  apply  their  arbitration  priorities  to 
the  m  =  lg  n  busses,  each  bit  on  a  separate  bus;  the  result  being  the  bitwise  OR  of  their  ar¬ 
bitration  priorities.  As  arbitration  progresses,  each  competing  module  monitors  the  busses 
and  disables  its  drivers  according  to  the  following  rule:  if  the  module  is  applying  a  0  (that 
is,  not  applying  a  1)  to  a  particular  bus  but  detects  that  the  bus  is  carrying  a  1  (applied  by 
some  other  module),  it  ceases  to  apply  all  its  bits  of  lower  significance.  Disabled  bits  are 
re-enabled  should  the  condition  cease  to  hold.  The  effect  of  this  rule  is  that  the  arbitration 
proceeds  in  lg  n  stages  from  the  most  significant  bit  to  the  least  significant  bit.  Each  stage 
consists  of  resolving  another  bit  of  the  highest  competing  binary  priority,  which  leads  to  a 
worst-case  arbitration  time  of  f  =  lgrz  (in  units  of  bus-settling  delay). 

‘Throughout  this  paper  we  count  only  arbitration  busses  that  are  used  for  encoding  the  priorities 
Several  additional  control  busses  are  used  by  ail  schemes  and  are  therefore  not  counted. 
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Figure  1:  Asynchronous  binary  arbitration  process  with  4  busses.  The  competing  modules  are 
ci,  c^.  C9,  and  cio<  with  corresponding  arbitration  priorities  0010.  0101.  1001.  and  1010.  Bits  in 
shaded  regions  are  not  applied  to  the  busses.  The  process  takes  4  stages. 


For  example,  consider  a  system  of  n  =  16  modules  that  uses  m  =  lg  16  =  4  arbitration 
busses,  with  the  16  arbitration  priorities  consisting  of  all  the  4-bit  codewords  {0000,  0001. 
0010.  0011,  0100,  0101.  0110,  0111,  1000.  1001,  1010.  1011.  1100,  1101.  1110,  1111}. 
Figure  1  outlines  an  asynchronous  binary  arbitration  process  among  four  such  modules  c2. 
c5.  c9.  and  c10,  with  corresponding  arbitration  priorities  0010,  0101.  1001,  and  1010.  The 
arbitration  process  begins  by  bitwise  ORing  the  four  arbitration  priorities.  After  one  unit 
of  bus-settling  delay  (stage  1),  bus  63  settles  to  the  value  1,  where  it  will  remain  for  the 
duration  of  the  arbitration.  By  the  above  rule,  each  of  modules  c2  and  c5  disables  its  last 
three  bits.  In  the  meantime,  however,  each  of  modules  Cg  and  c10  disables  its  last  two  bits, 
because  of  the  1  on  bus  b7.  At  the  end  of  stage  2,  bus  &2  settles  to  the  value  0,  where  it  will 
remain  for  the  rest  of  the  process.  As  a  result,  modules  C9  and  C10  now  re-enable  their  low 
order  bits  (stage  3),  which  results  in  bus  bx  settling  to  a  1  at  the  end  of  stage  3.  Finally,  in 
stage  4.  module  eg  ceases  to  apply  its  last  bit,  because  of  the  1  it  detects  on  bus  6],  which 
results  in  bus  bg  settling  to  a  0  at  the  end  of  stage  4.  This  arbitration  process  required 
t  =  lg  16  =  4  stages  to  complete. 

In  this  paper  we  show  that  the  asynchronous  binary  arbitration  scheme  can  in  fact 
be  improved.  We  introduce  the  new  asynchronous  binomial  arbitration  scheme,  that  uses 
one  more  arbitration  bus  in  addition  to  the  lg  n  busses  of  binary  arbitration,  but,  most 
surprisingly,  reduces  the  arbitration  time  to  jlgn.  In  asynchronous  binomial  arbitration, 
we  use  (lgn  +  l)-bit  codewords  as  arbitration  priorities  and  follow  the  same  arbitration 
protocol  of  asynchronous  binary  arbitration.  Our  binomial  arbitration  scheme  guarantees 
fast  arbitration  by  employing  certain  codewords  that  exhibit  small  data- dependent  delays 
during  arbitration  processes.  For  example,  by  using  the  following  set  of  5-bit  codewords 

{00000.  00001,  00010,  00011,  00100,  00110,  00111,  01000,  01100.  OHIO,  01111,  10000. 

11000.  11100,  11110,  11111}  as  arbitration  priorities,  we  can  arbitrate  among  16  modules 
using  5  busses  in  at  most  2  stages.  Figure  2  outlines  an  asynchronous  binomial  arbitration 
process  among  four  such  modules  C!,  ce,  cn<  and  Ci2,  with  corresponding  arbitration  priori¬ 
ties  00001.  00111,  10000,  and  11000  from  the  above  set,  that  completes  in  2  stages.  It  turns 
out  that  for  any  subset  of  the  above  16  codewords,  the  corresponding  arbitration  process 
takes  at  most  2  stages.  In  Section  3,  we  show  how  to  design  a  good  set  of  codewords  for 
general  values  of  n  by  using  binomial  codes  as  arbitration  priorities 
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Figure  2:  Asynchronous  binomial  arbitration  process  with  5  busses.  The  competing  modules 
are  c\.  c6.  cu.  and  cr2,  with  corresponding  arbitration  priorities  00001,  00111,  10000.  and  11000. 
Bits  in  shaded  regions  are  not  applied  to  the  busses.  The  process  takes  2  stages. 

The  remainder  of  this  paper  explores  priority  arbitration  schemes  that  employ  busses  to 
arbitrate  among  n  modules.  In  Section  2  we  discuss  priority  arbitration  and  formally  define 
the  asynchronous  model  of  priority  arbitration  with  busses.  Section  3  describes  the  two 
known  asynchronous  schemes:  linear  arbitration  and  binary  arbitration,  and  presents  our 
new  asynchronous  binomial  arbitration  scheme,  which  with  m  =  lg  n  +  1  busses  arbitrates 
>n  t  =  |  lg  n  units  of  time.  In  Section  4  we  extend  binomial  arbitration  and  present  the 
generalized  binomial  arbitration  scheme  that  achieves  a  spectrum  of  bus-time  tradeoff  of 
the  form  m  =  ©(fn1^),  between  the  number  of  arbitration  busses  m  and  the  arbitration 
time  t.  for  values  of  1  <  t  <  lg  n  and  lgn  <  m  <  n.  The  established  bus-time  tradeoff  is  of 
great  practical  interest,  enabling  system  designers  to  achieve  a  desirable  balance  between 
amount  of  hardware  and  speed.  We  present  a  variety  of  extensions  to  the  results  of  this 
paper  in  Section  5. 


2  Asynchronous  Priority  Arbitration  with  Busses 

In  this  section  we  discuss  priority  arbitration  and  formally  define  the  asynchronous  model 
of  priority  arbitration  with  busses.  The  definitions  in  this  section  model  typical  implemen¬ 
tations  of  asynchronous  priority  arbitration  mechanisms  that  employ  busses. 

Arbitration  is  the  process  of  selecting  one  module  from  a  set  of  contending  modules.  Tn 
asynchronous  priority  arbitration  with  busses,  each  module  is  assigned  a  unique  arbitration 
priority  —  an  encoding  of  its  name  —  which  is  used  in  determining  logic  values  to  apply 
to  he  busses  during  arbitration.  An  arbitration  protocol  determines  the  logic  values  that 
a  competing  module  applies  to  the  busses  based  on  the  module's  arbitration  priority  and 
potentially  also  on  logic  values  on  other  busses.  The  beginning  of  an  arbitration  process 
is  identified  by  a  system-wide  signal,  usually  called  REQUEST  or  ARBITRATE.  The 
resolution  of  an  arbitration  process  is  the  collection  of  settled  logic  values  on  the  busses  at 
the  end  of  the  process,  which  should  uniquely  identify  the  competing  module  having  the 
highest  arbitration  priority. 


4 


Throughout  this  paper  we  use  the  following  notations  and  assumptions.  The  set  C  = 

{cq-O . Cn-i}  denotes  the  n  system  modules  (chips),  which  we  assume  are  indexed 

in  increasing  order  of  priority.  The  m  open-collector  (wired-OR)  arbitration  busses  are 

denoted  by  B  =  {bo-^i . where  the  busses  are  indexed  in  increasing  order  of 

'ignificance  (to  be  elaborated  later).  The  set  P  -  {po.p, . Pn-i}  consists  of  n  distinct 

arbitration  priorities,  with  p,  being  the  arbitration  priority  of  module  c,.  Arbitration 
priorities  are  only  a  convenient  mechanism  of  encoding  the  modules'  names,  and  in  manv 
asynchronous  schemes  arbitration  priorities  are  m-bit  vectors  that  competing  modules 
apply  to  the  m  busses  during  arbitration.  When  necessary,  we  denote  the  bits  of  an 

arbitration  priority  p  by  p<0).  p(1).  p(2) .  in  order  of  increasing  significance.  We  assume 

that  each  module  is  connected  to  all  busses  and  can  thus  read  from  and  potentially  write 
to  any  bus.  All  modules  follow  the  same  arbitration  protocol  in  interfacing  with  the  busses 
and  reaching  conclusions  concerning  the  arbitration  process.  Finally,  we  assume  that  only 
competing  modules  apply  logic  values  to  the  busses;  noncompeting  modules  do  not  interfere 
with  the  busses.  All  our  assumptions  are  standard  design  practice  in  many  systems. 

In  asynchronous  priority  arbitration  with  busses,  we  restrict  the  arbitration  process 
to  be  purely  combinational  by  requiring  that  the  arbitration  logic  on  all  the  modules 
together  with  the  arbitration  busses  form  an  acyclic  circuit.  Using  combinational  logic  with 
asynchronous  feedback  paths  may  introduce  race  conditions  and  metastable  states,  which 
can  defer  a  bitration  indefinitely  (see  [1.  20,  21]).  The  acyclic  nature  of  the  arbitration 
logic  imposes  a  partial  order  on  the  busses,  which  can  be  extended  to  a  linear  order.  The 
significance  of  the  linear  order  on  the  busses  is  that  logic  values  on  higher  indexed  busses 
can  be  used  to  determine  logic  values  of  lower  indexed  busses  but  not  vice  versa.  We 
formalize  this  idea  in  the  following  definition  of  an  acyclic  arbitration  protocol. 

Definition  1  Let  P  be  a  set  of  arbitration  priorities.  An  acyclic  arbitration  protocol  of  size 
m  for  P  is  a  sequence  F  =  {/m_ ,, . . . .  fx ,  /0)  of  m  functions,  f,  :  P  x  {0.  1  _  {0.  1 }. 

for  j  =0.1 . m  —  1. 

In  asynchronous  priority  arbitration  with  busses,  every  module  has  arbitration  circuitry 
that  implements  the  same  acyclic  arbitration  protocol,  but  with  the  module's  arbitration 
priority  as  a  parameter.  The  m  arbitration  busses  are  ordered  from  6m_i  down  to 
in  accordance  with  the  acyclic  nature  of  the  circuit.  Informally,  function  f:  takes  an 
arbitration  priority  p  €  P  and  m  —  j  —  1  bit  values  on  the  highest  m  —  j  —  1  busses  b^_x 
through  6;+1,  and  determines  the  bit  value  that  a  competing  module  c  with  arbitration 

priority  p  applies  to  bus  6;,  for  j  =  0,  1 . m  —  1.  An  arbitration  process  among  several 

contending  modules  consists  of  the  competing  modules  applying  logic  values  to  the  m 
busses  according  to  the  acyclic  arbitration  protocol  of  size  m. 

Measuring  the  arbitration  time  of  asynchronous  mechanisms  is  somewhat  problematic. 
We  follow  a  standard  approach  taken  in  many  bus  systems  (see  [6.  10,  11.  14.  16.  24,  25]) 
and  measure  the  arbitration  time  in  units  of  bus-settling  delay.  Bus-settling  delay,  Ibu».  is 
the  time  it  takes  for  a  bus  to  settle  to  a  stable  logic  value,  once  its  drivers  have  stabilized, 
which  includes  the  delays  introduced  by  the  logic  gates  driving  the  bus.  the  bus  propagation 
delay,  and  any  additional  time  required  to  resolve  transient  effects  such  as  the  wired-OR 


siiU'h.  In  effect,  we  model  an  open-collector  bus  as  an  OR  gate  with  delav  Tbus,  the  time 
it  takes  for  the  output  of  the  gate  to  stabilize  on  a  valid  logic  value,  once  its  inputs  have 
reached  their  final  values.  An  arbitration  process  is  modeled  as  a  sequence  of  stages,  each 
'akmg  Thus  time,  and  the  arbitration  time  is  defined  as  the  number  of  stages  it  takes 
'inf  ii  ah  busses  stabilize.  This  approach  models  the  situation  in  manv  bus  s  vs  terns  rather 
accurately,  i  More  discussion  of  measuring  the  arbitration  time  in  units  of  bus-settline 
delay  i.;  deterred  until  Section  5.) 

We  next  formally  define  the  notion  of  an  arbitration  process  of  an  acvclic  arbitration 
protocol  on  a  set  of  competing  arbitration  priorities.  We  characterize  the  arbitration 
process  by  the  collection  of  the  logic  values  on  the  m  busses  at  the  end  of  each  computation 
-tage.  We  use  v/J]  to  denote  the  logic  value  on  bus  6,  at  the  end  of  the  /th  computation 

"tage.  tor  j  —  0.  1 . m  —  1  and  l  =0.1 .  Without  loss  of  generality,  we  assume  that 

an  arbitration  process  begins  with  all  busses  being  in  logic  value  0. 

Definition  2  Let  P  be  a  set  of  arbitration  priorities.  F  be  an  acyclic  arbitration  protocol 
of  size  m  for  P,  and  Q  C  P  be  a  set  of  competing  arbitration  priorities.  The  arbitration 
process  of  f  on  Q  is  the  successive  evaluation  of 

r,[0]  =  0  . 

=  V  . i>iM)  • 

p€Q 

tor  j  =0.1 . m  —  1  and  /  =  0.  1 .  We  say  that  the  arbitration  process  takes  t  stages 

if  t  >  0  is  the  smallest  integer  for  which  iy[f]  =  i ;}[t  +  1],  for  j  =  0.  1 . m  -  1.  The 

resolution  of  the  arbitration  process  is  the  sequence  of  values  . v0't 

Definition  2  characterizes  an  arbitration  process  as  a  successive  application  of  the 
acyclic  arbitration  protocol  F  to  the  set  of  competing  arbitration  priorities  (J  and  the 
current  state  of  the  m  busses.  The  arbitration  process  terminates  when  no  more  changes 
in  the  state  of  the  busses  occur,  at  which  point  a  resolution  is  reached.  It  is  relatively  easv 
to  verify  that  any  arbitration  process  of  an  acyclic  arbitration  protocol  F  of  size  m  takes 
at  most  rn  stages.  This  is  the  case  because  at  each  computation  stage  of  an  arbitration 
process,  at  least  one  more  bus  stabilizes  on  its  final  value. 

A  better  upper  bound  for  the  number  of  stages  taken  by  arbitration  processes  is  given 
by  the  depth  of  the  acyclic  arbitration  protocol.  As  discussed  above,  the  acyclic  nature 
of  the  arbitration  logic  imposes  a  partial  order  on  the  busses.  We  can  therefore  statically 
partition  the  m  busses  into  d  levels,  such  that  the  computation  for  a  bus  in  a  certain 
level  only  uses  the  values  of  busses  in  previous  levels.  More  formally,  given  an  acyclic 
arbitr  tion  protocol  F  of  size  m,  we  simultaneously  partition  the  m  functions  of  F  into  d 
nonempty  disjoint  sets  Fo,  F\, . . . .  F<*_ i,  and  the  m  busses  of  B  into  d  corresponding  sets 

Br>.  B\ . Bi_  i,  with  fj  €  F^  if  and  only  if  b;  €  B for  0  <  j  <  m  —  1.  and  0  <  h  <  d—  1. 

The  partition  must  have  the  property  that  the  computation  of  a  function  f}  €  F,  depends 

only  on  the  arbitration  priorities  and  on  values  of  busses  in  sets  B0,  Bx . Fs-i-  The 

depth  of  an  acyclic  arbitration  protocol  F  of  size  m  is  defined  as  the  smallest  d.  for  which 
a  partition  as  above  exists.  The  depth  of  an  acyclic  arbitration  protocol  is  never  greater 
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than  its  size.  The  next  theorem  shows  that  any  acyclic  arbitration  protocol  of  depth  7 
reaches  a  resolution  after  at  most  t  =  7  computation  stages. 

Theorem  1  Let  P  be  a  set  of  arbitration  priorities,  F  be  an  acyclic  arbitration  protocol 
”/  'i:e  rn  for  P.  and  d  be  the  depth  of  F.  Then,  for  any  subset  Q  Q  P  of  competing 
arbitration  priorities,  the  arbitration  process  of  F  on  Q  takes  at  most  d  stages. 

Proof.  By  :  iduction  on  d.  the  depth  of  the  acyclic  arbitration  protocol  F . 

Base  case:  d  =  0.  For  depth  7  =  0.  there  are  no  arbitration  busses  and  the  claim  hold-' 
immediately  for  arbitrary  Q. 

Inductive  case:  d  >  0.  Given  an  acyclic  arbitration  protocol  F  =  . f\- fo,  of 

size  m  and  depth  d  for  P.  we  can  partition  F  =  'JdhZl0F^  and  B  =  U^I qB/,  as  above.  Without 
loss  of  generality,  we  assume  that  the  last  level  consists  of  the  r  functions  and  busses  with 

indices  0.  1 . r  —  1.  The  first  d  —  1  levels  of  F  constitute  an  acyclic  arbitration  protocol 

P  =  W.P,/'-.  =  ifm- \ . /r-M./r)  of  size  m  —  r  and  depth  d  —  1  for  P.  By  induction,  the 

arbitration  process  of  F1  on  Q  takes  at  most  7—1  stages.  That  is.  for  any  r  <  j  <  m  -  1 
and  /  >  7  -  1.  we  have  v}[l]  =  v}[d  —  lj.  In  addition,  according  to  the  acyclic  arbitration 
protocol  F.  we  also  have  that  for  any  0  <  i  <  r  —  1  and  k  >  7  >  0. 

i-,'M  =  \J  f,ip.  k  -  lj . vr[k  -  lj) 

p£Q 

=  V  /*(p-  l'm-l[d  -  lj . t'r[7-lj) 

p€Q 

=  r,[7]  . 

because  the  7th  level  depends  only  on  busses  bm_i  down  to  br  and  because  k  —  1  >  7  -  1. 
This  proves  that  the  arbitration  process  takes  at  most  7  stages.  I 

Theorem  1  shows  that  the  number  of  stages  that  an  arbitration  process  takes  is  bounded 
by  the  depth  of  the  acyclic  arbitration  protocol  F.  This  bound  represents  a  standard  static 
approach  in  the  analysis  of  delays  in  digital  circuits,  namely,  that  of  counting  the  number 
of  gates  on  the  longest  path  from  the  inputs  to  the  outputs.  In  this  paper,  however,  we 
introduce  and  use  a  novel  dynamic  approach  of  bounding  the  number  of  stages  that  an 
arbitration  process  takes  by  a  careful  analysis  of  the  data-dependenc  delays  experienced 
in  the  arbitration  circuits.  In  doing  so,  we  exhibit  arbitration  schemes  that  guarantee 
termination  of  any  arbitration  process  in  a  circuit  of  size  and  depth  m  after  a  fixed  number 
of  stages  t.  for  values  of  0  <  t  <  m. 

To  complete  the  definition  of  asynchronous  priority  arbitration  schemes,  we  need  to 
introduce  the  notion  of  an  interpretation  function.  Suppose  we  have  a  set  of  arbitration 
priorities  P  and  an  acyclic  arbitration  protocol  F  of  size  m  for  P.  An  interpretation 
function  for  P  and  F  is  a  function  win  :  {0,  1  }m  — ♦  P.  such  that  for  any  Q  C  P.  with 
p  <=  Q  being  the  highest  arbitration  priority  in  Q  and  (um_i, . . . ,  ui,  t’o)  being  the  resolution 

of  the  arbitration  process  of  F  on  Q ,  we  have  WlN(um_i . t'l.t’o)  ~  p ■  Informally. 

win  interprets  the  resolution  of  any  arbitration  process  of  F  by  identifying  the  highest 
competing  arbitration  priority.  We  are  now  ready  to  define  an  asynchronous  priority 
arbitration  scheme  for  n  modules,  m  busses,  and  t  stages. 
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Definition  3  An  asynchronous  pnont y  arbitration  scheme  for  n  modules,  m  busses,  and 
'  stages  is  a  triplet  Ain.m.t)  =  (P.  F.  WIN),  where 

•  P  is  a  set  of  n  arbitration  priorities'. 

•  /■’  is  an  acyclic  arbitration  protocol  of  size  m  for  P: 

•  w  I N  is  an  interpretation  function  for  P  and  F: 

' : i i i  that  for  any  Q  C  P ■  the  arbitration  process  of  F  on  Q  takes  at  most  t  stages. 

Deiinit  ion  3  emphasizes  the  role  of  the  arbitration  priorities,  which  are  just  a  mechanism 
to  distinguish  between  different  modules.  It  will  become  apparent,  however,  that  careful 
design  of  the  codewords  used  as  arbitration  priorities  has  a  significant  impact  on  the 
arbitration  time.  In  the  next  Section,  for  example,  we  demonstrate  that  by  using  the  set 
of  i  lg  n  +  1  t-bit  binomial  codes  as  arbitration  priorities,  we  can  ?  'Sieve  an  arbitration  time 
of  t  =  ~  tg  n. 

3  Asynchronous  Priority  Arbitration  Schemes 

In  this  section  we  first  use  our  framework  to  describe  two  commonly  used  asynchronous 
prioritv  arbitration  schemes:  linear  arbitration,  which  with  m  —  n  busses  arbitrates  in  time 
t  =  1.  and  binary  arbitration,  which  with  m  =  lg  n  busses  arbitrates  in  time  t  =  lgn.  We 
then  present  our  new  as  nchronous  scheme,  binomial  arbitration,  which  with  m  =  lg  n  -i-  1 
busses  arbitrates  in  time  t  =  4  lg  n. 

The  Asynchronous  Linear  Arbitration  Scheme 

1'his  scheme  uses  m  =  n  busses  and  arbitrates  among  n  modules  in  t  =  1  stages.  To 
arbitrate,  contending  module  ct  applies  a  I  to  bus  6,,  for  0  <  i  <  n  -  * .  and  does  not 
interfere  with  other  busses.  This  translates  to  module  c,  having  an  n-bit  arbitration  priority 
pt.  such  that  p[})  =  1  if  i  =  j  and  p[j)  =  0  otherwise.  After  t  =  1  units  of  time,  all  the 
busses  stabilize  on  their  final  values,  and  the  module  with  a  1  on  the  bus  with  the  highest 
prioritv  is  recognized  as  the  winner.  This  scheme  can  also  be  implemented  wdth  tri-state 
busses,  since  at  most  one  module  writes  to  any  given  bus.  The  scheme  is  also  known 
as  decoded  arbitration  and  is  used  in  a  number  of  bus  systems  and  interrupt  arbitration 
mechanisms  (see  (10,  12,  18,  26]). 

Formally,  we  define  this  scheme  as  LlNEAR(n,n,  1)  =  (P.  F.  win),  where 

•  P  =  (p,  =  tr-"1  1  O'  :  for  r  =  0,l . n  -  1}. 

•  F  =  (fn-i . /i,/o).  where  f}lp.  em_i - +  1 )  =  p(j),  for  ;  =  0.  1 . n  -  1. 

•  wiN(l)*  1  o)  =  0*  1  0n_i_  1  =  pn-k- 1.  for  0  <  k  <  n  —  1  and  any  a  6  j0.  1  p  k  '. 
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Notice  that  although  the  size  of  the  acyclic  arbitration  protocol  of  LINEAR  is  rn  =  n. 
its  depth  is  only  d  —  1.  which  according  to  Theorem  1  shows  that  the  asynchronous  linear 
arbitration  scheme  takes  at  most  t  =  1  stages  to  arbitrate. 


The  Asynchronous  Binary  Arbitration  Scheme 

This  scheme  uses  m  =  lg  n]  busses  and  arbitrates  among  n  modules  in  t  =  fig  n]  stages. 
The  arbitration  priority  p,  of  module  c,  is  the  binary  representation  of  i.  for  0  <  i  <  n  -  1. 
To  arbitrate,  contending  module  c  drives  its  binary  priority  p  onto  the  m  busses,  from 
Prn~n  i  the  most  significant  bit  of  p)  onto  bus  brn_l ,  down  to  p{0)  (the  least  significant  bit 
of  pi  onto  bus  6q;  the  result  being  the  bitwise  OR  of  the  binary  priorities  of  the  competing 
modules.  During  arbitration,  each  competing  module  c  monitors  the  busses  and  disables 
its  drivers  according  to  the  following  rule:  let  p(i)  be  the  /th  bit  of  the  binary  prioritv  p. 
and  let  n  be  the  binary  value  observed  on  bus  6/,  for  0  <  /  <  m  -  1.  Then  if  p[l)  =  0  and 
=  1.  module  c  disables  all  its  bits  p{j)  for  j  <  l.  Disabled  bits  are  re-enabled  should  the 
condition  cease  to  hold.  After  t  =  [lg  n]  units  of  time,  all  the  busses  stabilize  on  their  final 
values,  and  the  module  whose  arbitration  priority  appears  on  the  busses  is  the  winner. 
This  scheme  was  developed  by  Taub  [23} ,  and  is  also  known  as  encoded  arbitration  (see 
'6.  10.  14.  24.  25]). 

Formally,  we  define  this  scheme  BIN  ARY(  n.  [Ign]  .  [lg  n] )  =  (P.F.  win)  as  follows.  For 
'implicity  of  notation  we  use  m  =  [lg  n] . 


P  =  { Pi  =  fm-i  ■  •  •  :  where  em_!  •  •  •  is  the  binary  representation  of  i,  for 


i  =  0.  1 . n  -  1}. 

P  =  !7m -l . /i-/o).  where 


/;  I  P<  -  1  •  ■  •  <  1 ) 


o  v!:/+y  (p{i)  =  o  a  i’(  =  i ) 

p<J>  otherwise  . 


for  ]  =0,1 . m  —  1. 

•  WIN  I  a  i  =  a.  for  any  a  6  {0,  1  }m. 


Notice  that  the  size  m  and  the  depth  d  of  the  acyclic  arbitration  protocol  of  BINARY  are 
equal,  specifically  m  =  d  =  [Ign],  This  can  be  verified  by  noticing  that  the  computation 
'or  each  bus  b},  where  0  <  j  <  m  —  1,  takes  into  account  values  on  busses  6/,  for  j  <  /  < 
m  —  1.  This  implies,  according  to  Theorem  1,  that  the  asynchronous  binary  arbitration 
mheme  takes  at  most  t  =  [Ign]  stages  to  arbitrate.  On  the  other  hand,  it  has  been 
Town  in  [2.  10.  11,  24,  25,  27]  that  there  are  examples  where  a  binary  arbitration  process 
takes  exactly  [Ign]  stages.  These  examples  consist  of  arbitrating  among  bad  subsets  of 
arbitration  priorities,  where  at  each  stage  the  binary  value  of  exactly  one  more  bit  of 
the  highest  competing  binary  priority  is  resolved.  Our  asynchronous  binomial  arbitration 
scheme,  presented  next,  guarantees  fast  arbitration  by  employing  certain  codewords  that 
exhibit  small  data-dependent  delays. 
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The  Asynchronous  Binomial  Arbitration  Scheme 

This  scheme  uses  m  =  fig  n  -+-  1"  busses  to  arbitrate  among  n  modules  in  t  =  j±lgn' 
stages.  This  scheme's  acyclic  arbitration  protocol  and  interpretation  function  are  identical 
’u  r hose  of  the  binary  arbitration  scheme,  and  thus  the  same  hardware  can  be  used.  The 
cuily  difference  is  that  binomial  codes  are  used  as  arbitration  priorities  rather  than  ali 
’.he  2'n  possible  m-bit  codewords  of  binary  arbitration.  Alternatively,  with  m  busses,  this 
'cheme  can  arbitrate  among  2m_1  modules  in  t  =  [4(m  — 1)1  stages.  We  next  describe  the 
binomial  coaes  and  begin  by  defining  the  interval-number  of  a  binarv  codeword. 

Definition  4  The  interval-number  of  a  binary  codeword  p  is  the  number  of  intervals  of 
consecutive  l's  or  0's  that  it  contains,  disregarding  leading  0‘s. 


Thus,  for  example,  the  interval-number  of  001011  is  3.  the  interval-number  of  0000  is 
0.  and  the  interval-number  of  10101010  is  8.  In  general,  an  m-bit  binary  codeword  p  with 
interval-number  r.  has  the  form  p  =  0mc  lm'  0mj  lm3  •  ■  •  6mr .  where  6  £  {0.1}:  m0  >  0: 
m ,  >  0  for  1  <  j  <  r;  and  m}  —  m.  We  next  define  the  binomial  codes  of  length  m. 

Definition  5  The  set  of  binomial  codes  of  length  m.  denoted  by  D(m ).  is  the  set  of  all 
the  m-bit  binary  codewords  that  have  interval-number  at  most  [|(m  —  1)1. 

The  binomial  codes  of  length  m  are  in  fact  all  the  m-bit  codewords,  that,  after  deleting 
leading  0‘s  have  at  most  [i- ( m  —  1)]  intervals  of  consecutive  l’s  or  0’s.  For  example,  the 
binomial  codes  of  length  4  is  D( 4)  =  {0000.0001.0010.0011.0100.0110.0111.  1000.  1100. 
1110.  1111}.  consisting  of  11  codewords  that  have  interval-number  at  most  2.  As  another 
example,  the  binomial  codes  that  were  used  in  the  introduction  are  Dio)  =  {00000.  00001. 
U0010.  00011. 00100.  00110.  00111.01000.  01100.  OHIO.  01111.  10000.  11000.  11100.  11110. 
11111}.  consisting  of  the  16  codewords  of  length  5  with  interval-number  at  most  2.  For 
general  values  of  m.  Corollary  3  in  Section  4  shows  that  there  are  at  least  2Tn_1  binomial 
codes  of  length  m.  By  taking  m  =  fig  n  +  1].  this  translates  to  at  least  2ri*n+ll~1  >  n 
binomial  codes,  which  means  that  there  are  enough  arbitration  priorities  for  n  modules. 

Formally,  we  define  this  scheme  BlNOMIAL(n,  fig  n  4-  1]  ,  j^lgnj)  =  (RF.  WIN)  as  fol¬ 
lows.  We  use  m  =  flgn  +  1}  and  t  =  [|lgn]  for  simplicity  of  notation. 


•  P  =  D(m). 

•  F  =  {/„_! . /i./o),  where 


/A?.  t’m-l  ....  t';  +  l) 


o  if  virr+v  {p{,)  =  °Av<  =  0 

pi-1'  otherwise  , 


for  j  =  0.  1 . m  —  1. 

•  wiN(a)  =  a.  for  any  a  €  {0, 1  }m. 
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It  remains  to  show  that  the  asynchronous  binomial  arbitration  scheme  indeed  arbitrates 
among  n  modules  in  at  most  t  —  ^lgnj  stages.  Notice  that  a  standard  static  analysis 
of  the  arbitration  circuitry,  as  given  for  example  in  Theorem  1.  does  not  give  the  desired 
result,  since  both  the  size  and  the  depth  of  the  acyclic  arbitration  protocol  F  of  binomial 
arbitration  are  m  =  d  =  fig  n  +  1]  •  In  Section  4.  we  use  a  novel  dynamic  approach  of 
analyzing  the  data-dependent  delays  experienced  in  arbitration  processes,  and  prove  the 
correctness  of  our  scheme  as  a  special  case  of  our  generalized  binomial  arbitration  scheme. 


4  Generalized  Binomial  Arbitration 


In  this  section  we  extend  the  ideas  of  the  asynchronous  binomial  arbitration  scheme  of 
Section  3  bv  presenting  the  generalized  binomial  arbitration  scheme  that  with  m  busses 
and  in  at  most  t  stages,  arbitrates  among  n  =  (7)  modules.  By  Stirling's  approxi¬ 

mation.  the  asymptotic  bus-time  tradeoff  of  the  generalized  binomial  arbitration  scheme 
is  approximately  m  =  \tn1^1 .  This  bus-time  tradeoff  is  of  great  practical  interest,  enabling 
system  designers  to  achieve  a  desirable  balance  between  amount  of  hardware  and  speed. 
The  performance  of  the  generalized  binomial  arbitration  scheme  is  based  on  an  analysis  of 
data-dependent  delays. 

We  first  define  the  set  of  generalized  binomial  codes  of  length  m  and  diversity  r. 

Definition  6  The  set  of  generalized  binomial  codes  of  length  m  and  diversity  r.  denoted 
by  G’fm.r).  is  the  set  of  all  m-bit  binary  codewords  that  have  interval- number  at  most  r. 

Generalized  binomial  codes  serve  as  arbitration  priorities  in  the  generalized  binomial 
arbitration  scheme.  The  next  lemma  determines  the  cardinality  of  the  set  of  the  generalized 
binomial  codes  of  length  m  and  diversity  r. 

Lemma  2  The  set  G(m.r)  contains  £[_ 0  (7)  distinct  codewords. 

Proof.  To  simplify  the  counting,  we  take  ail  the  codewords  in  G(m,  r)  and  append  a  0  at 
their  beginning.  This  results  in  a  set  of  (m  +  l)-bit  words,  that  begin  with  a  0  and  have  at 
most  r  switching  points  from  a  consecutive  interval  of  0’s  to  a  consecutive  interval  of  1  s 
and  vice  versa.  The  number  of  such  words  is  £J_ 0  (7)'  s*nce  there  are  exactly  that  many 
possibilities  of  choosing  at  most  r  switching  points  out  of  m  possible  positions.  H 


Corollary  3  There  are  at  least  2m_1  binomial  codes  of  length  m. 


Proof.  By  our  notation,  the  set  of  binomial  codes  of  length  m,  D(m),  is  defined  by 
Dim)  =  Gim.  \\(m  -  1)1).  According  to  Lemma  2,  we  have 


s  (-)• 
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The  sum  includes  the  first  [|(m  —  1)|  t  1  binomial  coefficients,  which  constitute  at  least 
a  half  of  all  the  m  +  1  binomial  coefficients.  The  partial  sum  is  therefore  at  least  a  half  of 
the  fuii  sum.  which  is  2m.  We  therefore  conclude  that  j£)(m)|  >  i  .  2m  =  2m_1.  I 

The  Asynchronous  Generalized  Binomial  Arbitration  Scheme 

This  scheme  uses  m  busses  and  arbitrates  in  at  most  t  stages,  for  t  <  m.  With  the  m 
and  t  parameters  determined  this  scheme  can  arbitrate  among  at  most  n  =  £!_0  (m) 
modules.  The  acyclic  arbitration  protocol  and  the  interpretation  function  of  this  scheme 
are  identical  to  those  of  the  binary  arbitration  scheme  of  Section  3,  and  thus  the  same 
hardware  can  be  used.  The  only  difference  is  that  generalized  binomial  codes  from  G'fm.  t  \ 
are  used  as  arbitration  priorities. 

Formally,  we  define  this  scheme  GENERALIZED-BlN'OMIALfn.  m.  t)  =  (P.  F.  WIN).  for 
n  =  £‘=o  (7)'  as  follows. 

•  P  =  G(m.  t). 

•  F  =  (fm-i . /i,/o).  where 


fj(P'  t’m-  l  •  •  •  •  1  ) 


0  if  vr=7ii  (p(0  =  0  a  V,  =  l) 

p(j)  otherwise  , 


for  j  =0.1 . m  —  1. 

•  vviN'(a)  =  a,  for  a  €  {0,  l}m. 

The  idea  behind  generalized  binomial  arbitration  is  that  the  interval-number  of  the 
highest  competing  arbitration  priority  bounds  the  number  of  arbitration  stages.  In  binary 
arbitration,  where  all  the  2m  m-bit  codewords  are  used,  arbitration  processes  can  take  as 
many  as  m  stages,  where  at  each  stage  one  more  bit  of  the  highest  competing  arbitration 
priority  is  resolved.  For  generalized  binomial  arbitration,  however,  we  select  codewords 
that  have  at  most  t  intervals  of  consecutive  l’s  or  0’s.  The  following  theorem  uses  data- 
dependent  analysis  to  argue  that  any  arbitration  process  takes  at  most  r  stages,  where  r 
is  the  interval- number  of  the  highest  competing  arbitration  priority,  by  showing  that  at 
each  stage  the  arbitration  process  resolves  at  least  one  more  interval  of  consecutive  bits. 


Theorem  4  Consider  a  generalized  binomial  arbitration  process  on  m  busses.  Let  Q  be 
the  set  of  competing  arbitration  priorities ,  p  be  the  highest  arbitration  priority  in  Q,  and 
r  be  the  interval-number  of  p.  Then  after  3  stages,  for  any  s  >  r,  bus  6;  carries  the  logic 
value  pl'JK  for  0  <  j  <  m  —  1. 

Proof.  We  prove  the  theorem  by  induction  on  r  for  arbitrary  values  of  m.  We  use  the 

notation  v}  [fc]  to  denote  the  logic  value  on  bus  b}  at  the  end  of  stage  k,  for  j  =  0,1 . m  —  1 

and  k  =  0,  1 . 
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Base  case:  r  =  0.  The  codeword  p  consists  of  m  consecutive  0’s,  that  is,  p(j)  =  0  for 

j  =  0.  1 . m  —  1.  Since  p  is  the  highest  arbitration  priority  in  Q ,  then  any  q  6  Q  must 

also  have  q =  0  for  j  =  0.  1 . m  —  1.  By  our  assumption  that  all  the  m  busses  are 

initially  in  logic  value  0.  and  since  according  to  the  acyclic  arbitration  protocol  no  module 
ever  applies  a  1  to  any  of  these  busses,  the  m  busses  remain  in  logic  value  0  forever.  In 

other  words,  after  s  stages,  for  any  s  >  r  =  0,  we  have  c,[s]  =  [0]  =  0  =  p,J\  for 

j  =  0.  1 . m  —  I.  which  proves  the  claim. 

Inductive  case:  r  >  0.  The  codeword  p  has  m  bits  and  interval-number  r.  and  is  thus 
of  the  form  p  =  0m°  lm'  0m2 1’7’3  •  ■  where  6  €  {0,  1);  m0  >  0:  m,  >  0  for  1  <  ;  <  r; 

and  rrij  =  m.  We  first  concentrate  on  the  first  r  -  1  intervals  of  p.  and  define  the 

set  R  of  reduced  codewords  of  length  m  =  m  -  mr  =  rni  •  by  ignoring  the  last  m, 

bits  of  the  codewords  of  <2-  It  is  easy  to  verify  that  p,  the  reduced  version  of  p,  is  the 

highest  codeword  in  R.  because  we  discarded  the  mT  least  significant  bits  of  codewords  in 
Q.  Furthermore,  the  interval-number  of  p  is  r  —  1,  since  the  last  interval  of  p  of  the  form 
6mr  was  ignored.  By  applying  the  claim  inductively  with  m  busses,  the  set  of  competing 
arbitration  priorities  R,  and  the  highest  arbitration  priority  p  of  interval-number  r  -  1.  we 
find  that  after  r  —  1  stages  the  most  significant  m  =  m  -  mr  busses  stabilize  to  the  bits  of 
p.  That  is.  for  any  k  >  r  -  1,  we  have  v}[k]  =  v}[r  —  1]  =  pW  =  p(;\  for  mr  <  j  <  m  —  I. 
We  now  consider  the  last  mr  busses.  6mr_i, . .  . .  bx>  b^.  There  are  two  cases  to  consider: 

h  =  1  The  rth  interval  of  p  is  an  interval  of  mr  consecutive  l's,  that  is.  p(,)  =  1  for  i  = 

0.  1 . mT-  1.  After  k  stages,  for  any  k>  r-  1,  the  most  significant  m  -  mr  busses 

carry  the  bits  of  p.  and  therefore  there  is  no  l  in  the  range  0  <  /  <  m  -  1.  with 
vt[k]  =  1  and  p{l)  =  0.  As  a  result,  the  module  with  arbitration  priority  p  applies 

all  its  last  mr  consecutive  l’s.  Therefore,  for  any  s  >  r  and  i  =  0. 1 . mr  —  1.  we 

have  i>,[s]  =  r,[r]  =  1  =  p(,>,  since  the  busses  implement  a  wired-OR  in  one  stage. 

6  =  0  The  rth  interval  of  p  is  an  interval  of  mT  consecutive  0’s,  that  is.  p(,)  =  0  for 

i  =  0.  1 . mr  —  1.  Since  p  is  the  mghest  arbitration  priority  in  Q ,  then  for  any  arbi¬ 

tration  priority  q  €  Q,  q  ±  p,  there  must  exist  an  /  in  the  range  mr  <  l  <  m  -  1 ,  with 
pin  =  1  and  q(l)  =  0.  After  k  stages,  for  any  k  >r  -  1,  the  most  significant  m  -  mr 
busses  carry  the  bits  of  p,  and  therefore  any  module  with  arbitration  priority  q  ^  p 

disables  at  least  its  last  mr  bits.  As  a  result,  for  any  s  >  r  and  i  =  0. 1 . mr  -  1. 

we  have  i\(s]  =  u,[r]  =  0  =  p(,),  because  the  busses  implement  a  wired-OR  in  one 
stage  and  no  module  applies  a  1  to  busses  6o  through  bmr_i  anymore. 

Thus,  after  s  stages,  for  s  >  r,  the  m  busses  carry  the  corresponding  bits  of  p.  H 

The  following  corollary  shows  that  by  taking  G(m,t),  the  generalized  binomial  codes 
of  length  m  and  diversity  t,  as  arbitration  priorities,  we  guarantee  that  any  arbitration 
process  completes  in  at  most  t  stages. 

Corollary  5  Consider  GENERALIZED- BiNOMIAL(n, m, f ),  the  generalized  binomial  arbi¬ 
tration  scheme.  For  any  subset  of  arbitration  priorities  Q  C  G(mJ),  the  corresponding 
arbitration  process  takes  at  most  t  stages. 
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Proof.  Let  p  be  the  highest  arbitration  priority  in  Q.  Since  the  interval-number  of  p  is 
at  most  t.  Theorem  4  guarantees  that  the  arbitration  process  on  Q,  with  p  as  the  highest 
arbitration  priority,  takes  no  more  than  t  stages.  flj 


The  Generalized  Binomial  Arbitration  Tradeoff 

The  generalized  binomial  arbitration  scheme  achieves  a  bus-time  tradeoff  of  the  form  n  = 
]T|=0  (?)•  which  by  Stirling's  formula  exhibits  asymptotic  behavior  m  -  ^tnl/t.  Figure  3 
presents  this  bus-time  tradeoff  for  a  system  consisting  of  n  =  64  modules.  The  number  of 
busses  varies  from  lg  n  =  6  to  n  =  64.  and  the  arbitration  time  is  in  the  range  1  to  Ig  n  =  6 
stages.  Generalized  binomial  arbitration  reduces  to  binary  arbitration  with  m  =  [lg  n]  =  6 
busses,  to  binomial  arbitration  with  m  =  fig  n  +  l]  =  7  busses,  and  to  a  modified  version 
of  linear  arbitration  (see  Section  5)  with  m  =  n  =  64  busses. 
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Figure  3:  Bus-time  tradeoff  of  the  generalized  binomial  arbitration  scheme  for  n  =  64  modules, 
using  6  <  m  <  64  busses  and  1  <  t  <  6  stages. 

Figure  3  demonstrates  that  neither  linear  arbitration  nor  binary  arbitration  efficiently 
utilize  the  resources.  For  example,  increasing  the  number  of  busses  used  in  binary  arbitra¬ 
tion  bv  one,  results  in  speeding  up  the  arbitration  process  by  a  factor  of  2.  as  exhibited 
by  our  binomial  arbitration  scheme.  On  the  other  hand,  allowing  another  time  unit  pier 
linear  arbitration  enables  reducing  the  number  of  busses  from  n  to  approximately  v2n- 

Notice,  however,  that  in  order  to  achieve  another  factor-of-2  improvement  in  the  arbi¬ 
tration  time,  adding  another  constant  number  of  busses  to  the  lgn  busses  is  not  enough. 
Asymptotically,  as  n  grows  without  bound,  we  need  to  use  more  than  (1  +  c)lgn  busses, 
for"  t  >  0.232.  in  order  for  the  sum  £U  (?)>  with  t  =  Jlgn,  to  be  at  least  n.  This 
can  be  verified  by  Stirling’s  formula,  since  when  m  is  greater  than  lgn  but  smaller  than 
1.232 lgn,  and  when  t  =  £lgn  <  m/4,  the  sum  of  the  first  m/4  binomial  coefficients 
for  0  <  /  <  m/4,  does  not  exceed  n.  This  demonstrates  that  our  binomial  arbitration 
scheme,  which  uses  lg  n  +  1  busses,  exhibits  a  most  economic  balance,  much  more  so  than 
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the  binary  arbitration  scheme.  Other  authors  lllj  have  also  discovered  that  bv  excluding 
certain  codewords,  the  arbitration  time  of  binary  arbitration  can  be  reduced.  We,  however, 
give  the  first  general  scheme  that  provides  a  full  spectrum  of  bus-time  tradeoff. 


5  Extensions 

This  section  contains  some  discussion,  additional  results,  and  directions  of  research  con¬ 
cerning  priority  arbitration  with  busses. 


Bus  Propagation  Delay,  Settling  Time,  and  Wired-OR  Glitch 

High-speed  busses  are  commonly  modeled  as  electrical  transmission  lines,  where  it  takes 
some  finite  amount  of  time  for  a  signal  to  propagate  through  the  bus  and  bring  the  bus  to 
a  stable  logic  value.  In  addition,  there  are  the  response  time  of  logic  gates  and  the  effect  of 
the  wired-OR  glitch  that  need  to  be  considered.  In  particular,  the  effect  of  the  wired-OR 
glitch  on  bus-settling  time  and  the  use  of  special  integration  logic  at  module  receivers  tu 
reduce  this  effect  (see  [3,  8.  16.  25]).  seem  to  support  our  model. 

Some  authors  carry  out  a  more  elaborate  analysis  of  high  speed  busses  (see  [2.  8. 
23.  24.  25]).  which  takes  into  account  the  distances  between  modules  on  the  bus  and 
imposes  certain  assumptions  on  the  arbitration  priorities.  In  [24.  25].  for  example.  Taub 
assumes  geographical  ordering  of  module  priorities  and  equal  distances  between  modules 
on  a  backplane  bus.  Counterexamples  to  Taub’s  analysis,  where  these  requirements  are 
not  met.  have  been  found  [2,  27].  Our  model,  on  the  other  hand,  is  applicable  to  a  wider 
classes  of  systems,  such  as  data  communication  broadcast  channels  and  bus  systems  were 
priorities  and  module  locations  are  not  predetermined  and  fixed. 

The  Asynchronous  k- ary  Arbitration  Scheme 

The  linear  arbitration  and  binary  arbitration  schemes  of  Section  3  use  n-ary  and  binary- 
representations,  respectively,  of  module  priorities.  We  can  also  use  radix-/t  representation 
of  module  priorities,  for  other  values  of  k ,  to  arbitrate  among  n  =  kl  modules  in  t  units 
of  time,  using  m  =  tk  busses.  We  sketch  the  asynchronous  fc-ary  arbitration  scheme  here 
due  to  its  simplicity  and  because  it  generalizes  the  linear  and  binary  arbitration  schemes 
rather  straightforwardly.  This  scheme  exhibits  a  bus-time  tradeoff  of  the  form  m  =  tnk/t . 
which  is  a  factor  of  e  worse  than  our  generalized  binomial  arbitration  scheme. 

Asynchronous  k- ary  arbitration,  for  2  <  k  <  n,  can  be  described  as  follows.  Each 
module  is  assigned  a  unique  Jfc-ary  arbitration  priority  consisting  of  t  radix-fc  digits.  We 
divide  the  m  =  tk  busses  into  t  disjoint  groups,  each  consisting  of  k  busses.  During 
arbitration,  competing  module  c  applies  the  t  radix-fc  digits  of  its  arbitration  priority  p  to 
the  t  groups  of  busses,  using  linear  encoding  of  its  digits  on  each  group  of  k  busses.  As 
arbitration  progresses,  competing  module  c  monitors  the  t  groups  of  busses  and  disables 
its  drivers  according  to  the  following  rule:  let  p ^  be  the  /th  radix-fc  digit  of  p  and  d/  be  the 
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highest  index  of  a  bus  in  the  / th  group  of  busses  that  carries  a  1.  Then  if  p[l)  <  module 
c  disables  all  its  digits  p(j)  for  j  <  /.  Disabled  digits  are  re-enabled  should  the  condition 
cease  to  hold.  Arbitration  proceeds  in  t  stages,  each  of  which  consists  of  resolving  the 
value  of  another  radix-A  digit  of  the  highest  competing  A'- ary  arbitration  prioritv. 


Modified  Linear  Arbitration 

A  modified  version  of  linear  arbitration,  which  uses  the  same  acyclic  arbitration  protocol  of 
binary  arbitration,  achieves  the  same  bus-time  tradeoff  as  linear  arbitration.  This  version 
is  the  generalized  binomial  arbitration  scheme  with  m  =  n  busses  and  t  —  1  time,  where  the 

arbitration  priority  of  module  c,  is  p,  -  0n-1'"1  1,+1 .  for  i  =  0.  1 . n  —  1 .  This  observation 

poses  an  interesting  question  regarding  the  universality  of  the  acyclic  arbitration  protocol 
of  binary  arbitration. 


Lower  Bound  for  Asynchronous  Priority  Arbitration 

The  asynchronous  generalized  binomial  arbitration  scheme  achieves  a  bus-time  tradeoff  of 
the  form  n  —  £j_0  ^7),  where  n  is  the  number  of  modules,  m  is  the  number  of  busses,  and 
i  is  the  arbitration  time.  We  conjecture  that  this  tradeoff  is  optimal  for  our  asynchronous 
priority  arbitration  model,  in  that  no  more  than  n  =  ]T/=0  (7)  modules  that  can  be 
arbitrated  with  m  busses  in  at  most  t  stages. 


Synchronous  Priority  Arbitration  Schemes 

In  this  paper  we  discussed  the  asynchronous  model  of  priority  arbitration  with  busses  and 
presented  several  asynchronous  schemes.  Considering  synchronous  priority  arbitration 
scheme  that  use  clocked  arbitration  logic,  we  can  show  that  a  synchronous  version  of  A -ary 
arbitration  achieves  a  bus-time  tradeoff  of  the  form  m  =  n1/,(  and  that  this  tradeoff  is 
optimal  in  a  related  cynchronous  model  of  arbitration.  We  can  also  demonstrate  how  to 
combine  asynchronous  combinational  schemes  with  synchronous  clocked  schemes  to  achieve 
a  wide  spectrum  of  bus-time  tradeoff. 


Resource  Tradeoffs 

Resouice  tradeoffs  of  the  form  m  =  ©(fnV*),  based  on  multiway  trees  and  the  special  class 
of  binomial  trees,  are  discussed  in  [4]  for  a  variety  of  problems  such  as  parallel  sorting 
algorithms,  searching  algorithms,  and  VLSI  layouts.  Asynchronous  priority  arbitration 
with  busses  can  in  fact  be  considered  as  a  selection  process  on  trees.  Asynchronous  A-arv 
arbitration  corresponds  to  a  selection  process  on  regular  trees  of  branching  factor  A,  while 
asynchronous  generalized  binomial  arbitration  corresponds  to  a  selection  process  on  the 
more  economical  “modified  binomial  trees”  of  [4]. 
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Technologies  for  Low  Latency  Interconnection  Switches 


Thomas  F.  Knight,  Jr. 

M.I.T.  Artificial  Intelligence  Laboratory 

Abstract 

This  paper  presents  an  engineering  design  for  a  low  latency  high  bandwidth  interconnec¬ 
tion  network  which  will  form  the  switching  substrate  for  a  multi-model  parallel  process¬ 
ing  system.  The  performance  is  enhanced  with  a  variety  of  approaches  covering  inter¬ 
connection  protocols,  routing,  fault  tolerance,  advanced  packaging,  and  electrical  inter¬ 
connection  techniques.  The  synergistic  application  of  these  technologies  leads  to  a  high 
performance  design. 


Motivation 

A  key  performance  factor  in  large  scale  parallel  computer 
systems  is  the  Latency  in  processor  communications. 
[Oenouzo*  88]  considers  a  program  with  available  parallelism 
p.  running  on  a  multiprocessor  of  size  n,  with  a  communi¬ 
cations  latency  /.  measured  in  terms  of  instruction  execu¬ 
tion  amts.  He  establishes  that  there  is  a  speedup  linear  in 
n  if  nl  <<  p,  but  that  this  speedup  approaches  an  asymp¬ 
totic  bound  of  pH  when  nl  >>  p. 

Our  parallel  programming  model  and  algorithm  design  can 
influence  the  available  parallelism,  or  the  average  length  of 
independently  scheduled  instruction  sequences(l],  but  the 
latency  of  the  communication  network  remains  one  of  the 
fundamental  characteristics  of  the  hardware  architecture. 

In  message  passing  models,  the  interprocessor  communica¬ 
tion  latency  appears  as  a  delay  in  receiving  messages  from 


[11  In  the  presence  of  code  blocks  of  length  q  which  c*n  be  exe¬ 
cuted  independently  without  interprocessor  communication. 
Denouzos  shows  that  the  relationship  is  modified  by  substi¬ 
tuting  pq  for  the  available  parallelism  p,  making  the  speedup 
leas  dependent  on  the  latency. 


remote  processors  [Daily  ss;.  In  shared  memory  systems 
[Butterfly  ST  Pfiater  851  the  latency  of  the  communication  net¬ 
work  affects  the  average  memory  reference  time.  Even  the 
addition  of  shared  memory  caches  (SCI .  Agin**]  88  3;s.ar, 
to  large  scale  parallel  shared  memory  systems  simply 
moves  this  latency  from  occurring  once  every  memory 
cycle  to  once  every  cache  miss  time.  Even  in  SIMD  archi¬ 
tectures  such  as  the  connection  machine  [Hillis  85],  the  long 
latency  for  communications  is  a  significant  bottleneck,  re¬ 
sulting  in  programmers  avoiding  its  use  when  possible. 

Several  recent  architectures  supporting  particular  program¬ 
ming  styles  drastically  lower  the  latency  of  communica¬ 
tions  to  achieve  higher  performance.  The  Ametek  hyper¬ 
cube  architecture! Ametck  861,  for  example,  achieves  micro¬ 
second  latencies  for  interprocessor  communication  as  com¬ 
pared  to  the  hundreds  of  microseconds  for  first  generation 
hypercube  processors  such  as  the  original  Caltech  de¬ 
sign  [Seuz  85].  Similarly,  the  Masspar  architecture  dramati¬ 
cally  reduces  the  latency  for  large  scale  SIMD  communica¬ 
tions  [Crordilsb  87]  compared  to  the  connection  machine 

Aiewife 

Ax  MIT,  Anant  Agarwal  and  !  are  designing  an  architecture 
called  Aiewife  which  has  as  an  explicit  goal  the  support  of 
a  wide  variety  of  parallel  programming  models.  As  such,  it 
provides  hardware  support  for  a  variety  of  programming 
styles,  including  several  types  of  shared  memory,  message 
passing,  and  data  level  parallelism.  To  achieve  this  broad 


range  if  mode'.  >upp<  r.  requires  an  extremely  low  latency 
communications  mechanism.  We  are  using  Lhis  detailed 
design  is  a  test-bed  for  the  broader  problem  if  designing 
extreme. >  large,  scalable  parallel  machine-,  *h.,ch  ore 
tlex.ble  r.  their  programming  style. 

The  Mew de-.gr  consists  of  three  major  compcr.er.ts 
The  first  component  ,s  a  simple  processor  characterized  by 
tost  context  switching,  fast  message  dispatching,  and 
support  for  data  typing.  The  second  is  a  cache  and 
.nterrrccessor  communications  controller  capable  of 
supporting  coherent  memory  access  in  the  absence  of  a 
-tingle  shared  bus.  Finally,  the  design  relies  on  a  fast, 
efficient  communication  network,  called  T'ansit 

"he  modularity  of  this  design  provides  an  opportunity  to 
reuse  persons  of  the  machine  as  a  substrate  for  other 
architectures  In  particular,  we  are  carefully  defining  the 
terrace  between  each  of  the  components  of  the 
architecture  to  allow  one  portion  to  be  replaced  by  -different 
or  higher  performance  equivalents.  Transit  supports  a 
carefully  defined  interface  to  the  cache  controller,  and  the 
.ache  controller  presents  both  a  uniform  shared  memory 
model  and  in  explicit  processor  to  processor  communica¬ 
tion  model  to  the  processors. 

Transit  Target  Specifications 

The  Transit  network  provides  uniform  communications 
between  256  processor 'memory  clusters.  Latency  for  a 
remote  memory  reference  ls  280  nanoseconds,  and  peak 
bandwidth  is  100  megabytes* second/port  The  remainder 
of  this  paper  concerns  the  technology  with  which  this 
network  is  constructed,  and  the  impact  these  techniques 
have  on  lowering  the  latency  of  communications.  We  will 
briefly  consider  more  advanced  interconnection  techniques 
and  address  the  issue  of  scaling  the  design  to  larger 
numbers  of  processors.  Because  of  space- limitations,  most 
of  the  discussion  will  consist  of  a  description  of  the 
techniques  Transit  uses  to  achieve  high  performance,  with 
little  discussion  of  alternative  possible  designs.  In  many 
cases  viable  alternatives  exist,  but  the  space  of  possible 
designs  is  so  large  that  it  is  impractical  in  a  short  paper  to 
discuss  alternatives  for  every  decision. 

Communication  Protocols 

Transit  uses  a  connection  based  source-responsible  routing 
protocol.  The  sending  controller  transmits  a  routing  header 
and  optional  data  forward  into  the  Transit  network,  while 
retaining  a  copy  of  the  message.  The  network  makes  a  best 
effort  to  establish  a  communicauon  linx  between  the  source 
and  destination  port.  After  transmitting  the  forward 


message,  the  communication.-  path  *h..h  nos  \-c 
established  is  electrically  reversed,  an  ackr--  M-mc- 
and  optional  data  flows  from  die  recipient  to  the  '• 

for  any  reason,  the  attempted  comm-jr-:.- ro  :V.  ir-m  : 
is  the  re  spurs. bi’aty  .t  the  sender  to  retr .  the  ccr.e.:.  .- 

The  ability  of  the  sender  to  retr*  foiled  com  mu  "..'at  r> 
leads  to  important  simplifications  in  the  routing  element 
used  in  the  communication  switch,  since  *e  need  rot 
buffer  or  flow  control  the  messages  being  >ert  to  an 
dement.  Instead,  the  element.  ,f  it  congested.  ;s  free  to 
discard  awkwardly  timed  messages.  Similarly,  failures  cf 
routing  elements  or  the  wiring  between  'hem  ;ar.  be 
handled  by  simply  detecting  she  failed  car, mm. .rat.  -- 
attempt  using  checksum. ng  techniques,  and  q.SvU- 
damaged  data.  The  total  failure  of  routing  e'.en  er u  :r 
interconnect  is  handled  by  redundant,  random  .red 
described  belovu  Explicit  acknowledgment  might  see:-  to 
slow  the  network,  but  is  required  eventually  even 
networks  which  accept  responsibility  for  deliver..-; 
messages.  Here,  the  reply  data  from  a  memory  request.  r 
example,  can  be  combined  with  the  acknowledgment. 

Each  port  of  the  Transit  network  consists  of  a  nine  hit  w  de 
path,  synchronously  clocked  every  10  nanoseconds  One 
bit  is  a  framing  bit,  used  to  distinguish  control  byres  from 
data  bytes,  and  the  remaining  eight  bits  are  used  to  transmit 
one  byte  of  routing  information,  or  data.  Fig  ire  1  sho*s 
the  details  of  the  tnter-chip  timing  of  a  simple  transfer 

In  the  idle  state,  the  sender  transmits  a  zero  framing  bit 
each  clock  cycle.  At  the  start  of  a  message,  one  byte  cf 
routing  data  and  a  framing  bit  of  one  are  sert  ,nto  the  .rp-.t 
port.  Each  clock  cycle  thereafter,  a  data  byte  is  trarsm  tied 
into  the  input  port.  This  forward  stream  of  bytes  .s 
pipelined  through  each  stage  of  the  interconnects  r 
network,  and  eventually  reaches  the  destination. 

When  all  of  the  sender  data  has  been  transmitted,  a 
distinguished  byte,  the  turn  byte,  (all  one's  with  3  zero 
framing  bit)  is  transmitted.  This  is  a  signal  to  reverse  the 
data  flow  in  the  network.  On  receipt  of  the  turn  byte,  each 
stage  of  the  network  starts  pipelining  data  back  to  the 
original  sender.  When  the  turn  byte  reaches  the 
destination,  a  complete  reverse  path  has  been  set  up 
allowing  data  to  flow  from  destination  to  the  sender  The 
destination  transmits  an  acknowledgment,  followed  by  any 
number  of  data  bytes.  The  framing  bit  in  the  reverse 
direction  is  used  to  signal  the  completion  of  data  transfer 

Status  information  is  available  to  the  sender  as  a  side  effect 
of  this  sequence.  Because  of  pipelining,  a  D  stage  switch 
has  2D  clock  periods  of  delay  following  the  senders 
transmission  of  the  turn  byte  and  prior  to  the  arrival  of  the 
acknowledgment.  During  this  period,  each  stage  of  the 
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interconnection  switch  transmits  a  pair  of  status  bytes  back 
t  he  sender,  indicating  which  (if  any)  of  the  output  ports 
were  assigned  to  the  connection  leaving  this  switch  stage, 
and  a  checksum  for  the  message  at  this  stage  of  the  switch. 
The  status  information  is  used  by  the  sender  to  determine 
the  exact  path  through  (be  switch  this  message  was  routed 
with,  to  determine  where  a  message  was  garbled  in  transit, 
and  to  determine  at  which  switch  stage  a  message  was 
discarded,  if  it  was  thrown  away. 

Interconnection  Topology 

The  Transit  network  consists  of  a  four  stage,  radix  four 
omega  network,  providing  256  possible  destinations.  Each 
routing  element  is  an  eight  input  port,  eight  output  port 
switching  component.  The  eight  input  ports  are 


interchangeable,  and  the  eight  output  ports  are  paired  in 
four  groups  of  two  ports  each.  An  input  message  is  routed 
to  one  of  the  two  available  output  ports  in  the  direction 
specified  by  two  bits  of  the  routing  byte.  Once  this  routing 
is  performed,  the  path  which  is  set  up  will  remain  assigned 
until  the  connection  is  dropped.  If  neither  of  the  two 
output  ports  in  the  desired  output  direction  is  available,  the 
message  is  discarded. 

The  wiring  of  the  port  from  each  stage  of  the  switch  to  the 
next  is  arranged  so  that  the  data  wires  are  rotated  by  two 
bits.  This  permutation  of  the  data  wires  allows  the  two  bit 
field  of  the  routing  byte  seen  by  each  of  the  four  stages  of 
the  switch  to  differ,  routing  the  message  on  all  eight  bits. 

The  pairing  up  of  output  ports  in  the  routing  element 


provides  an  .mporant  fault  tolerance  feature  of  the  design 
if  both  output  ports  in  a  given  direction  are  available  a  hen 
a  message  is  to  be  routed,  a  pseudorandom  number 
generator  :s  used  to  arbitrarily  chose  between  them  This 
issures  that  the  path  taken  through  the  switch  on  an  attempt 
to  retry  transmitting  a  message  after  failure  will,  with  high 
probability,  take  a  different  path  than  the  first  try  This 
path  redundancy  allows  fault  tolerance  to  be  built  into  the 
network  at  very  low  overhead.  Ideally,  the  two  output 
ports  which  go  in  logically  identical  directions  should  be 
wired  to  physically  distinct  routing  elements  to  provide 
better  fault  coverage  This  is  possible  in  all  but  the  final 
stage  of  the  switch,  where  all  messages  destined  for  a 
particular  processor  must  flow-  through  one  routing 
element.  The  necessity  to  wire  this  final  stage  differently 
is  in  conflict  with  the  desire  to  wire  ail  stages  with  the 
same  permutation,  for  reasons  which  we  describe  below  in 
t.he  section  on  packaging 

The  choice  of  four  pairs  of  output  ports  as  a  routing 
element  design  also  has  important  implications  for  the 
statistical  success  of  the  routing  process.  This  issue  is 
discussed  in  detail  in  the  section  below  on  performance. 
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Figure  2 


Packaging  Issues 

The  packagmg  of  high  performance  systems  has  an 
extreme  impact  on  their  speed  -  to  the  extent  that  system 
level  design  is  often  dictated  by  available  packaging 
technology.  The  Transit  network  is  packaged  using  a 
unique  three-dimensional  wiring  technology  which  allows 
roughly  equivalent  wiring  density  in  all  three  dimensions. 
The  approach  consists  of  using  conventional  primed  circuit 
boards,  with  a  50  ohm  controlled  impedance  stnpline 
structure,  for  two  of  the  three  dimensions.  For  the  third 
dimension  of  wunng,  these  boards  are  layered  on  top  of  one 
another,  is  shown  in  figure  2. 

Contact  between  the  boards  is  provided  with  button  boards 
iSmoiley  «1,  a  term  describing  compliant  fine  wire  fuzz 
buttons  pushed  into  blank,  drilled  printed  circuit  board 
material  ^figure  3).  These  buttons,  formed  by  compressing 
25  micron  wire  into  a  cylindrical  die  20  mils  in  diameter  by 
-10  mils  high,  are  used  on  staggered,  50  mil  centers,  to 
provide  extremely  dense  connectors  between  layers  of  the 
packaging  Because  of  the  short  distances  involved, 
impedance  mismatch  is  minimal  if  care  is  taken  with 
ground  wire  density. 

Components  are  packaged  into  this  structure  by  mounting 
them  on  earners  also  fabneated  from  standard  PC  board 
materials.  A  recessed  cavity  is  used  to  hold  the  die,  which 
is  then  wire  bonded  or  tab  interconnected  to  the  earner. 
The  earner  is  unlike  normal  integrated  circuit  packages  in 
that  its  pins  are  simply  flat  pads  located  on  both  the  top  and 


bottom  of  the  earner  board.  Thus  terminals  of  the  die  ire 
accessible  from  top  or  below,  and  wires,  if  necessary,  can 
be  simply  routed  through  the  earner  with  no  connection  :e 
the  die.  The  earner  board  provides  a  controlled  impedance 
environment  for  signals  up  until  the  bond  to  the  die  In 
addition,  the  earner  provides  low  inductance  power  and 
ground  ptane  decoupling  capacitance  through  integral  layer 
proximity,  as  well  as  locations  for  mounting  explicit 
ceramic  bypass  capacitors.  Through  holes  are  provided  .n 
the  earner  for  verucal  fluid  cooling  channels. 

Component  earners,  together  with  upper  and  lower  button 
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x>ar.i  connectors.  are  placed  into  a  holding  frame  which 
provides  two  dimensions  of  horizontal  alignment.  The 
holding  frame,  with  its  button  boards  and  carriers,  forms  a 
layer  in  the  stack  Stack- aide  pnnted  circuit  boards 
typically  alternate  aith  layers  of  holding  frames  and  chips, 
providing  a  compact,  dense,  three  dimensional  means  for 
pudding  relatively  small  30x30x30  cm*  three  dimensional 
structures 

The  logical  structure  of  the  four  stage  radix  four  omega 
network  is  mapped  onto  the  three-dimensional  package  by 
packaging  each  stage  of  the  routing  network  in  a  separate 
layer  of  the  stack.  Signal  flow  through  the  network  is  thus 
logically  ;n  the  vertical  direction,  from  one  layer  to  the 
next.  The  omega  topology  has  the  valuable  property  of 
having  identical  wiring  patterns  between  stages  of  the 
network;  this  property  is  exploited  tn  the  stack  by 
replicating  the  interconnection  structure  of  each  stage 
multiple  times.  Ideally,  then,  the  stack  consists  of  a 
structure  alternating  a  fixed  omega  wiring  permutation  of 
signals  m  the  horizontal  direcuon,  with  layer,  of  routing 
elements.  Four  such  wiring,  routing  element  pairs  complete 
the  three  dimensional  stack  Figure  4  shows  the  wiring 
pattern  for  the  horizontal  wires  in  one  of  the  layers  of  the 
stack.  Each  line  represents  a  pair  of  ports;  this  figure 
shows  a  the  winng  for  a  64  port  network. 

The  vertical  signal  flow  means  that  inputs  to  the  swatch 
structure  are  available  at  the  top,  and  that  outputs  are 
available  at  the  bottom.  Because  we  wish  to  use  this 
network  as  a  processor  to  processor  communication  switch, 
the  inputs  and  outputs  must  be  available  in  physical 
proximity.  This  is  solved  by  rouang  the  network  outputs 
back  through  the  stack  vertically  on  additional  wiring 
channels.  These  channels  take  up  little  space,  since  there  is 
no  horizontal  wiring  associated  with  them. 


Figure  4 


Providing  electrical  power  to  the  circuits  arid  remcvi-c 
waste  heat  remain  significant  issues  The  fuzz  'utter,,  hi 
excellent,  low  resistance  connectors,  and.  because  :f  me 
Large  number  required  between  boards  for  impedance 
control,  exist  in  abundance  to  provide  a  low  indue ta-.e 
path  vertically  between  boards  Horizontally,  power  s 
provided  using  integral  power-ground  plane  structures 
within  the  controlled  impedance  boards  These  planes  also- 
provide  important  low  inductance  power  supply  filter, ng. 
Power  is  brought  into  the  stack  with  power  lugs  mounted 
on  boards  at  the  center  (vertically!  of  the  stack  which 
extend  horizontally  beyond  the  normal  boundary  af  the 
stack. 

Heat  is  removed  from  the  stack  using  FC-"  Fiucrir.ert 
Liquid  flowing  vertically  through  the  stack.  The  enure 
stack  is  normally  run  immersed  in  Fluonner.,  and 
pressurized  fluid  is  pumped  into  a  distribution  mar, odd  a: 
the  top  of  the  stack.  This  manifold  also  acts  as  one  ;f  the 
pressure  plates  which  apply  compressive  force  to  mate  the 
large  number  of  button  board  contacts.  The  high  heat 
capacity  per  unit  volume  of  Liquid  cooling  relative  to  air 
cooling  dictated  its  use  in  the  high  density  structure 
Modest  flow  rates  (2  gal/m  in)  should  be  adequate  to  cool 
our  prototype  system. 

As  a  result  of  the  aggressive  packaging  used  in  this  design, 
the  longest  wires  are  approximately  45  centimeters. 
Modest  cost,  easily  fabricated,  low  dielectric  constant  PC 
board  materials  such  as  Norplex  cyanate  esters,  have  a 
dielectric  constant  of  3.1.  The  wire  delay  of  the  longest 
paths  in  the  design  is  thus  approximately  3.65 
nanoseconds. 

This  composite  structure  has  many  advantages  over 
conventional  packages.  First,  since  it  is  three  dimensional, 
the  wire  length  for  a  given  wiring  density  is  substantially 
smaller  than  structures  otherwise  achievable  using  two 
dimensional  packaging,  backplanes,  and  cables.  Second,  it 
is  easily  repairable  by  disassembly  of  the  stack,  since  it 
involves  no  soldering  or  other  permanent  connections. 
Third,  though  it  might  seem  awkward  to  debug,  simple 
boards  can  be  constructed  which,  when  added  to  the  stack 
between  particular  layers,  allow  signals  in  that  layer  to  be 
examined. 

Electrical  Issues 

An  early  decision  was  to  totally  abandon  the  idea  of  using 
multi-drop  bus  like  electrical  structures  in  the  design.  The 
drastic  reduction  in  signal  speed  and  line  impedance  due  to 
capacitive  loading  of  the  transmission  lines  in  even 
carefully  engineered  systems  argued  strongly  that  point  to 
point  communications  be  used. 


A  dominant  electrical  design  issue  was  how  to  drive  die 
very  Urge  number  23 .000 y  of  terminated  signal  wires  m 
die  switch  A  50  ohm  impe .dance  level  ls  dictated  bv 
pracacai  wire  geometnes,  and  could  not  in  any  case,  be 
raised  by  more  than  a  factor  of  two.  With  standard  CMOS 
signal  swm.gs  of  five  volts,  the  parallel  termination  of  a 
single  wire  would  dissipate  a  half  watt!  We  reduce  this 
power  dissipation  by  a  factor  of  50  by  lowering  die  signal 
swing  to  one  volt,  and  by  senes  cemunanng  the 
transmission  lines.  The  senes  termination  allows  the 
impedance  seen  by  the  output  dnver  to  be  twice  die 
impedance  of  die  Line,  but  is  applicable  only  to  point-to- 
point  wiring 

The  senes  termination  resistance  is  provided  within  the 
puilup  and  pulldown  transistors  of  the  output  dnver,  as 
described  in  -Knight  S3!.  Our  current  design  differs  a  little 
from  die  technique  described  in  that  paper  in  that  it  uses  a 
digitally  controlled  O'A  like  structure  to  vary  the  output 
transistor  resistance  The  use  of  resistive  puilup  and 
pulldown  devices  has  important  speed  implications,  since 
the  devices  need  not  must  not>  be  large  devices,  and  hence 
can  be  dnven  far  more  quickly  than  conventional  low 
impedance  output  dnver  transistors.  Providing  the 
terminating  resistors  on-chip  also  has  the  Large  advantage 
of  eliminating  23.000  discrete  resistors  from  the  stack,  and 
allows  for  electncal  compensation  of  both  the  driver 
impedance  and  the  Line  impedance  against  manufactunng 
variation. 

The  one  volt  logic  swing  of  the  output  dnver  is  compatible, 
in  magnitude,  with  the  approximately  one  volt  swing  of 
ECL  logic  families.  As  a  result,  the  use  of  small  quanuties 
of  small  scale  ECL  logic  for  applications  such  as  clock 
buffers  and  170  interfacing  is  practical,  using  a  pair  of 
offset  power  supplies  for  the  ECL  circuitry. 

One  of  the  difficulues  *e  have  encountered  is  the  very  low 
efficiency  of  one  volt  power  supplies.  At  these  voltages, 
the  voltage  drop  of  a  silicon  diode  (.7  volts)  becomes  a 
major  source  of  power  supply  inefficiency.  Synchronously 
switched  MOS  power  devices  used  as  rectifiers  will  solve 
this  problem,  but  there  is  as  yet  no  commercial  demand  for 
this  development  As  vlsi  devices  scale  to  smaller 
dimensions,  the  need  for  high  efficiency,  low  voltage 
power  supplies  will  become  very  evident 

We  are  currently  investigating  two  techniques  for  clock 
distribution.  The  conventional  approach  is  to  use  multi¬ 
stage  clock  fanout  with  equal  length  and  matched  delay 
transmission  lines  to  each  network  element  for  delivery  of 
a  time  aligned  clock  signal.  A  second  approach  of  creating 
the  clock  signal  as  a  single  node,  wired  in  a  highly 
interconnected  three-dimensional  grid  may  offer  some 


advantage  The  gnd  must  be  driven  at  multiple  locau-m- 
< perhaps  every  5- 10  cm  m  all  axes>  and  treated  as  a  lumped 
capacitive  load  Since  the  clock  waveform  is  of  a  ur.de 
frequency,  we  can  consider  the  possibility  of  resvrati'c 
this  capacitance  with  a  tuned  inductance  to  reduce  clock 
distribution  power 

Performance 

One  of  the  advantages  of  the  unbuffered  sivle  of 
communication  network  is  ease  of  performance  anaivsis. 
Since  the  network  timing  is  determined  entirely  by  the 
pipeline  delay,  the  latency  for  successful  messages  :s  easy 
to  calculate  Since  the  system  is  memory  less  except  at  the 
sender,  the  probability  of  routing  success  nthm  the 
network  can  be  calculated  quite  easily,  using  the  analytic 
techniques  described  in  [Knight  S9]. 

A  typical  message  might  consist  of  a  remote  memory  read 
access.  Such  a  request  would  send  an  address  forward 
through  the  Transit  network,  cycle  the  remote  memory .  and 
return  an  acknowledgment  and  the  read  data.  For  32  bit 
address  and  data,  the  forward  message  is  five  bytes  long, 
and  the  reverse  message  is  five  bytes  long.  A  two  byte 
checksum  will  likely  be  added  to  these  message  lengths, 
although  these  are  indistinguishable  from  data  to  the 
network.  The  pipeline  delay  of  the  network  is  four  clocks, 
SO  the  remote  access  is  complete  in  7+4+7-I-4  =  22  cycles. 
By  making  optimistic  assumptions  about  the  success  of 
checksuming  the  data,  we  can  overlap  a  portion  of  the 
forward  message  delivery  with  the  cycling  of  the  remote 
memory  system.  As  soon  as  two  bytes  of  address  are 
received,  we  can  initiate  a  Ras  cycle  on  the  remote  memory 
system,  and  start  the  memory  cycle  in  parallel  with  receipt 
of  the  remainder  of  the  address.  Similarly,  die 
acknowledgment  byte  may  be  sent  prior  to  having  access  to 
the  read  data.  This  gives  60  nanoseconds  at  the  remote 
processor/memory  pair  to  perform  a  memory  ras,caS 
cycle  and  obtain  the  data. 

The  probability  of  successfully  routing  through  the  Transit 
network  as  a  function  of  input  loading  is  shown  in  figures  5 
and  6.  The  input  loading  is  the  probability  that  an  input 
pert  has  a  message  being  sent  or  received.  The  best  that 
can  be  achieved  without  combining  approaches  is  the  non¬ 
blocking  behavior  of  the  crossbar.  Figure  5  shows  the 
performance  of  a  crossbar  network  with  one  output  port  to 
each  logical  destination.  For  comparison,  the  eight  stage 
omega  network  constructed  out  of  2  by  2  switch  elements, 
and  the  four  stage  omega  network  constructed  out  of  4  x  4 
switch  dements  are  also  shown.  The  Transit  network, 
further  limited  to  a  single  output  port  per  logical 
destination  is  shown  on  this  same  graph.  The  extra  output 
parts  bexween  switching  elements  leads  to  behavior  very 
close  to  the  ideal  behavior  of  the  crossbar. 
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Figure  5 

Similarly,  m  figure  6,  *e  show  the  ideal  behavior  of  a  large 
non-bloc  king  s'  x  2n  crossbar  which  allows  two  output 
pons  per  logical  destination.  Below  it  we  show  the 
penorrr.ar.ee  of  the  Transit  network,  again  demonstrating 
performance  close  to  the  behavior  of  a  crossbar. 

The  reason  for  this  good  performance  lies  in  the  choice  of 
network  element  --  particularly  in  the  availability  of 
multiple  .output  paths  travelling  in  a  single  logical 
direction  The  performance  of  the  network  from  a 
probabilistic  standpoint  could  be  improved  yet  more  by 
constructing  a  switching  element  with  eight  inputs  and  two 
clusters  of  four  outputs  each,  where  each  of  the  four  ports 
in  a  cluster  travelled  in  a  logically  equivalent  direction. 
The  disadvantage  of  this  approach  is  the  doubling  of  the 
-umber  of  stages  in  the  network,  since  only  one  bit  worth 
of  routing  is  performed  per  stage  of  the  network.  The 
choice  of  the  element  for  Transit  was  dictated  by  a  desire 
to  minim  Lze  the  pipeline  delay  of  the  network  while 
maintaining  good  probabilisac  performance. 

Technology  Extrapolation  and  Limits 

The  approach  of  constructing  large  multi-stage  omega 
networks  becomes  infeasible  at  a  point  not  much  larger 
than  the  network  we  are  constructing,  due  to  the 
exponential  growth  of  wiring.  For  processor  networks 
larger  than  can  be  packaged  with  short  wiring,  the  architect 
land  ultimately  the  programmer)  must  face  the  importance 
of  locality  in  construcung  very  Large  parallel  machines. 
Perhaps  the  most  elegant  approach  to  acknowledging  the 
necessity  for  this  locality  is  the  fai-tree  (Leuenoc  S3]  ap¬ 
proach  A  fat-tree  can  be  thought  of  as  a  multi-stage 
omega  network  where  local  transactions  are  successively 
isolated  from  more  global  transactions.  The  more  global 
transactions  are  routed,  through  successively  more  narrow 
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channels,  towards  a  global  switching  array.  Finally,  they 
arrive  at  the  most  global  (root)  node  of  the  network,  and, 
from  there,  may  be  delivered  to  any  location.  The 
narrowing  of  the  channels  as  the  root  is  approached  allows 
this  network  to  scale  to  very  Large  arrays,  at  the  cost  of 
latency,  and  of  limited  ability  to  communicate  globally. 

We  can  construct  fat-tree  based  rouong  networks  from  the 
stack  structure  described  above  for  Transit  by  adding  one 
additional  routing  stage  per  stack.  The  purpose  of  this 
routing  stage  is  to  isolate  messages  destined  for  more 
global  stages  of  the  switch  from  those  that  may  be 
delivered  locally.  The  more  global  messages  are  routed  to 
the  bottom  of  the  stack,  where  they  connect  to  a  set  of 
flexible  printed  circuit  board  layers  used  as  cabling 
between  stacks.  The  other  end  of  these  flexible  PC  board 
cables  is  routed  to  the  top  of  another  stack,  along  with  the 
global  signals  from  three  additional  stacks.  Outputs  of  the 
global  stack  similarly  are  channeled  back  to  the  local 
stacks.  This  approach  of  constructing  a  fat-tree  like 
structure  from  a  tree  of  high  performance  rouung  stacks 
appears  to  be  an  effective  way  of  building  networks  which 
combine  high  performance,  an  ability  to  take  advantage  of 
locality,  and  scalability  to  tens  of  thousands  of  high 
performance  processors. 

Two  alternative  electrical  techniques  for  communicating 
between  routing  elements  appear  to  be  important 
alternatives.  One  is  the  approach  of  Rettberg,  Glasser  and 
Basset  [Reaberj  87)  for  eliminating  the  reliance  on  low  clock 
skew  in  the  signal  paths.  Future  versions  of  the  Transit 
network  will  likely  require  an  approach  similar  to  this, 
especially  if  the  wiring  between  stacks  is  long  enough  to 
impose  delays  large  compared  to  the  anticipated  clock  rate. 

Another  approach  which  we  are  devoting  some  attention  to 


is  the  nooon  of  transmitting  iata  between  chips  by  use  of 
modulated  microwave  earners.  The  advantage  of  this 
scheme  is  the  elimination  of  the  DC  component  of  the 
digital  signal,  transforming  a  broadband  digital  signal  into 
a  narrowband  RF  signal.  For  the  same  reasons  that 
modems  are  an  appropriate  technique  for  transmuting  data 
on  long  distance  telephone  lines,  the  use  of  narrowband 
-data  transmission  allows  many  electrical  tricks  which  are 
otherwise  not  available.  Transformers,  power  splitters, 
limiters,  stub  tuning  of  transmission  lines,  and  automatic 
gain  controls  can  all  be  used  to  good  advantage  in 
communicating  these  signals.  The  high  dispersion  of 
transmission  lines  associated  with  the  series  resistance  of 
the  line  is  a  much  smaller  problem  when  the  range  of 
frequencies  is  less  than  an  octave.  Finally,  and  perhaps 
most  compelling,  the  connection  of  signals  from  one 
physical  structure  to  another  need  not  be  done  with  wires, 
but  may  be  done  with  the  intrinsic  capacitance  of  adjacent 
metal  contacts.  A  chip,  for  example,  might  not  need  bond 
wires  for  the  signals,  but  only  for  power  and  ground 
distribution. 

Summary 

We  have  presented  an  .nitial  engineering  design  for  a  high 
performance  processor  to  processor  interconnection  switch 
intended  as  the  substrate  for  a  programming  model 
independent  computer  architecture.  Some  of  the  key 
elements  of  this  approach  have  already  been  tested  in 
prototype  form,  and  we  are  actively  pursuing  a  complete 
prototype. 
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